*Article* **Using Machine Learning in Predicting the Impact of Meteorological Parameters on Traffic Incidents**

**Aleksandar Aleksi´c, Milan Randelovi´ ¯ c and Dragan Randelovi´ ¯ c \***

Faculty of Diplomacy and Security, University Union-Nikola Tesla Belgrade, Travnicka 2, 11000 Belgrade, Serbia **\*** Correspondence: dragan.randjelovic@fdb.edu.rs

**Abstract:** The opportunity for large amounts of open-for-public and available data is one of the main drivers of the development of an information society at the beginning of the 21st century. In this sense, acquiring knowledge from these data using different methods of machine learning is a prerequisite for solving complex problems in many spheres of human activity, starting from medicine to education and the economy, including traffic as today's important economic branch. Having this in mind, this paper deals with the prediction of the risk of traffic incidents using both historical and real-time data for different atmospheric factors. The main goal is to construct an ensemble model based on the use of several machine learning algorithms which has better characteristics of prediction than any of those installed when individually applied. In global, a case-proposed model could be a multi-agent system, but in a considered case study, a two-agent system is used so that one agent solves the prediction task by learning from the historical data, and the other agent uses the real time data. The authors evaluated the obtained model based on a case study and data for the city of Niš from the Republic of Serbia and also described its implementation as a practical web citizen application.

**Keywords:** machine learning; regression; classification; prediction; meteorological parameters; traffic incidents; multi-agent architecture

**MSC:** 68T05; 68T42

#### **1. Introduction**

The parameters affecting the occurrence of traffic incidents (TI) comprise three main groups, and they are as follows: human factors, vehicle and environment [1]. However, in [2], the authors divide the main factors into five groups, i.e., in addition to the listed three groups, they add roadway as well as occupants and other road users, and in [3], the authors consider more types of parameters without any groups; therefore, it could be concluded that there is no single taxonomy. In this paper, the authors consider the influence of the mentioned third group, environmental factors, and in it, just the meteorological subgroup which belongs to a wider subgroup of atmospheric parameters from the environment factors group (Dastoorpoor et al. [4]). The prediction of the impact of atmospheric parameters on TI is an important task for solving one global, serious problem because they cause not only human losses but also economic damages. By 2030, TIs are predicted to become the sixth leading cause of death, overtaking cancer [5], i.e., the seventh leading cause of death, and overtaking HIV/AIDS [6] worldwide. TIs also have an economic importance because they cause 3% of the gross domestic product yearly loss globally and roughly double that in lower-middle-income countries [6].

Having in mind the above-mentioned data on the significantly expressed negative consequences of the occurrence of TI on human lives and the economy, it is obvious that the previously mentioned prediction of the influence of various (including meteorological), factors is one important task in preventing their occurrences. This process of predicting

**Citation:** Aleksi´c, A.; Randelovi´ ¯ c, M.; Randelovi´ ¯ c, D. Using Machine Learning in Predicting the Impact of Meteorological Parameters on Traffic Incidents. *Mathematics* **2023**, *11*, 479. https://doi.org/10.3390/ math11020479

Academic Editors: Snezhana Gocheva-Ilieva, Atanas Ivanov and Hristina Kulina

Received: 16 December 2022 Revised: 5 January 2023 Accepted: 7 January 2023 Published: 16 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the impact of different factors on the occurrence of a traffic accident has two main tasks: firstly, to help citizens themselves reduce the possibility that they will have any incidents in traffic, and, secondly, to help traffic police increase their control over known locations in specific weather conditions and, in this way, reduce the number of incidents. In this way, with the realization of those two main tasks, using the mentioned prediction, consequently, can reduce human casualties as well as economic damage. As we already mentioned, the subject of this paper is the prediction of the impact of meteorological factors as a subgroup of the group of atmospheric parameters and the wider group of environmental parameters. All those parameters included and others, such as sociological, geographical and so on, are of a different type and can be viewed in different ways. Therefore, for example, Chan et al. in [7] observe those parameters in groups of meteorological variables (temperature, pressure, humidity, wind speed, etc.), pollutant variables (PM, CO, O3, SO2, etc.), auxiliary variables (geographical, time, sociological, economic, relation to the type of road, etc.). Jalilian et al. in [8] consider each factor individually without their grouping. It is very important to remark that many of these groups of factors are considered as variables with different values through time historically which enables later learning of knowledge from them. The authors deal, in this paper, only with the first group of meteorological factors, from all the mentioned groups, and their impact on traffic accidents in a day. Therefore, the authors suggest that, in addition to the available data in real life from competent institutions, there are also large amounts of historical data collected in those institutions for the purpose of gaining knowledge using machine learning (ML) to predict the impact of the meteorological factors on the occurrence of TI.

Furthermore, in the paper, the authors propose an ensemble method of aggregation to solve the task of prediction. For this purpose, they used one aggregation of the mentioned two approaches of using historical and real-time data into a methodology that can be effectively implemented in a multi-agent system using modern intelligent technology.

For the evaluation of the impact of these factors on TI, we can find in the literature several classic statistical methods, and the most used among these is regression analysis, followed by factor and discriminant analysis and, on the other hand, some algorithms of data mining and artificial intelligence—between which, the most used are artificial neural networks and different algorithms of classification. Practically, this process means that the choice of the subset from the set of parameters must lead to the problem of the feature selection before creating a prediction [9]. Having in mind these facts, the authors set out to provide the report on one research method as the main objective of this manuscript, and the advantage of aggregation in the prediction model for determining the impact of the meteorological factors on TI in two methods will be discussed. These two methods that the authors proposed are the most used methods from the previously mentioned two groups, both in one ensemble ML prediction model: from the group of statistical methods, binary regression and from the group of ML methods, classification with feature selection. Moreover, as an obligatory part of this research was considering the realization of a developed ensemble model as one multi-agent system (MAS) in a global case, concrete in this paper is a two-agent system in which agent 1 draws knowledge through ML from historically available data, and then agent 2 deals with those same parameters, but in real time. Such two-agent architecture enables decision making using the algorithm which could be a decision matrix from the group of decision makers, and this will be proposed in the paper in the section which deals with the technical solution of the implementation of the proposed MAS including one emergent intelligence technique which enables the integration of MAS into a collaborative whole precisely based on that algorithm. Having in mind the complexity of the considered problem and the already-mentioned fact that the impact of meteorological factors on TI is only one small part of all the groups of factors—such as the presence of human factors, vehicle factors, other environmental parameters from different particles in the atmosphere including air pollution, road factors, geographical factors as well as factors of economic development and so on—it is an obvious need that this model implementation with one MAS must enable its permanent, continuous upgrading. Such a

solution, based on the approach of using MAS in the proposed form, the authors could not find in the literature.

In this paper, for the purpose of evaluating the proposed model, the authors used one case study which observes daily TI data for the city of Niš in the Republic of Serbia in the period from 1999–2009 and data for different meteorological factors for this period for the same city. This study determines individual influence of each of considered meteorological factor on a happening of TIs using for that the model of aggregation different classification algorithms. It is based on an algorithm which was previously pre-processed using different methods of feature selection from ML and binary regression analysis from the group of traditional statistics methodologies in one ensemble method of ML. In this way, the conditions for implementation are ensured in the already-described two-agent system for early warning of interested parties, before everyone else, all ordinary citizens and traffic police, including and using today's most popular social media platforms as can be found in the paper of Lu et al. [10]

Description of influence of different meteorological conditions on TI can be found in papers that use the application of different forms of regression models, from linear and binary regression, than general linearized model to the combination of artificial intelligence and regression and different autoregressive methods for that purpose. This type of statistical models is also often used for the predicting and impact of different mentioned groups as, for example, from an environmental group that could be a location, type of road, date and time and so on, and some individual factors from these groups and their combinations. The global review of possibilities and characteristics of different types of regression methods could be found in Trencevski et al. [11] and Gupta et al. [12]. Too many studies address the impacts of different meteorological parameters on traffic safety [13]. In one of them, a meta-analysis of 34 studies which deal with the effect of precipitation is given and, as a result, gave an average increase in traffic accidents of 71% and 84%, in case of rain and snowfall, respectively [14]. However, in [15], authors considered terms of crash severity and concluded that there is a significant reduction under rainy conditions compared to fine weather. Additionally, the effect of precipitation on traffic accidents is considered on Finnish motorways and it is concluded it could be different for different types of accidents, so that the relative risk for single accidents in relation with multiple accidents in case of snow is 3.37/1.98 [16]. In the paper [17] we find one investigation on the influence of 17 meteorological factors on the number of crashes in the Netherlands during 2002. The impact of a combination of the meteorological factors of temperature and precipitation is given in [18]. Study [19] investigates the impact of weather elements and extreme snow or rain weather changes on seven crash types using five years of data collected for the City of Edmonton in Canada. In [20], authors deal with different vehicle types, from high-sided trucks and buses to vans that are the most affected by strong wind; also, we can find in literature that in general, greater wind speeds increase the severity of traffic accidents caused by single trucks [21].

The effect of the sun glare on traffic accidents is the subject of a small number of studies, but authors of the paper [22] deal with this problem using data from signalized crossroads in the city of Tucson, USA.

They concluded that traffic accidents occur more frequently during glares from the rear-end and sideways and that a sun glare has no effect on the crash severity. However, in paper [23] we can find that traffic accidents in Japan indicate that the sun glare has a strong impact on pedestrian traffic accidents, crashes at crossroads, and bicycle crashes, while there is no indication that the impact of sun glare increases with vehicle speed [23].

In [24], authors deal with determining the impact of snowfall on TI. The different studies mentioned are focused on the impacts of individual or a group of meteorological factors on specific traffic accident types, but these studies could differ with relation to region, time period and methodology, and because of that, they are difficult for comparing the results.

In [25], the study in which correlation and linear regression analysis were conducted to estimate the influence of meteorological factors on road traffic injuries stratified by severity is presented. The study Khan et al. [26] shows that the occurrence of traffic accidents in hazardous weather conditions of wind, rainfall, snowfall and fog broadly follows the patterns for those weather parameters. A paper which deals with weather impacts on various types of road crashes: a quantitative analysis using Generalized Additive Model (GAM) method we can find in [27], but the paper [28] considers the same problem using the combined backward propagation-artificial neural network model (BP–ANN) regression model. In [29] one integer autoregressive model for prediction with four traffic safety categories: vehicle accidents, vehicle fatalities, pedestrian accidents and pedestrian fatalities in Athens was proposed. In [30], we can find the application of one autoregressive integrated moving average (ARIMA) model for determining the impact of weather factors on TI in France and comparing the general linearized model and ARIMA in [31] in the case of considering this problem in France, Greece, and the Netherlands. In [10], we can find the use of the regression model in one modern, conceptualized, complex system for prediction of influence of different whether factors on TI in China with help of data obtained from modern social media, combining these with physically sensed data and also with the help of regression methodology.

On the other hand, using ML algorithms for determining the importance of the individual impact of each of the many meteorological factors on traffic accidents as well as in determining suitable prediction models to solve this problem is today the other frequently used methodology. We can find more and more papers in the existing literature that use these two groups of methods to solve the posed problem discussed in this paper. These methods belong to different individual types of ML: classification, clustering, neural networks, and other standard ML methods; to the aggregations of these standard ML methods mutually or with classic statistical methods as for example regression, and in the end, the newest different ensemble methods is where the solution proposed by the authors belongs.

Thus, in [32], Zheng et al. consider different groups of atmospheric factors: meteorological variables (temperature, pressure, humidity, wind speed, etc.), pollutant variables (PM, CO, O3, SO2, etc.), auxiliary variables (geographical, time, sociological, related to the type of road, etc.), and we could practically find one comprehensive review and taxonomy of different types of ML methods which could be applied in atmospheric environment studies on TI for what is compatible with content presented in the already-cited reference [11], which particularly processed regression models. A similar problem from the standpoint of sensor using in this purpose is presented in [33], but in [34], a review of urban traffic flow prediction techniques with special focus on the literature review is presented. In [35–38], we can also find a similar comprehensive review of different artificial neural network (ANN) methods used for the same purpose. Using ML models based on ANN is a highly effective way to simulate the atmospheric environment, which is very important in the case of time-limited applications [39], and in this group, deep learning has received special research attention [40–42]. Different models of ANN are available, for example, recurrent ANN [43]. Additionally, the ANN predictions of meteorological impact factors on traffic accidents are available for geographically diverse areas across the globe: Switzerland [44], Bangladesh [45], Jordan [46], Iran [47], the USA [48], Australia [49], and are generally considered for developing countries [50].

Moreover, in the literature, we can find an aggregation of the most applicable ML ANN method with other ML methods: for example, with genetic algorithms in [51], with cluster algorithms in [52], using a logit model and factor analysis [53], with the random forest [54,55], and with the second most used ML the decision tree from a classification group of methods in the case study for data from Nigeria in [56]. Additionally, it is used in the case study of Nord England using the UK stats19 data set [57], and in the case study for data from the USA [58]. We can find the application of aggregated different classification methods in papers such as J48,ID3, classification and regression tree (CART), decision tree and Naive Bayes in [59]; and CHAID, J48 decision tree and Naive Bayes in [60].

In the already-cited reference [32], it is remarked that increasing the model type of ML models for prediction of the impact of atmospheric parameters on different fields of human life were ensemble models; this is the case in the field of traffic accident prediction as well. In [61], one systematic review of ML methods is given where ensemble methods, as most modern types, are considered in separate sections and in [62–65] we could find descriptions of different ensemble methods which deal with ensemble learning for predicting traffic accidents affected by meteorological parameters. Having in mind the model which authors will propose in this paper, it is especially important to remark that the ensemble methods based on aggregation classification and regression tree in one ensemble algorithm for the purpose to solve the considered problem of prediction as an impact of meteorological factors on traffic accidents are very rarely in the literature, but they could be found, as, for example, in [66].

Particularly, there is a trend of developing forecast models to predict future states in all types of traffic at the beginning of the 21st century. Different taxonomies of those models can be found, for example the division into parametric and non-parametric models depending on the distribution of input values [67], then a division into deterministic models in which the model outputs are fully determined by the input factors values, and probabilistic, i.e., stochastic models [68]. ARIMA is one of the most-used parametric methodologies with its different subtypes: for example, multivariate spatial-temporal autoregressive (MSTAR) model [69], which at the same time belongs into probabilistic methods according to the second mentioned division while time-series analysis and trends belong to deterministic methodology, for example [70].

Bayesian deep learning approach and convolutional neural networks are increasingly present in recently published literature to predict the influence of uncertain environmental parameters, including the meteorological factors considered in the paper on the TI [71]. It is especially expressed in the subfield of so-called short-term predictions of trajectories in different types of public traffic, for example, in aircraft trajectory predicting [71,72] as well as in the prediction of road traffic in general and autonomous driving [73]. Because the authors set as the main goal of this paper that it should give an answer to the two research questions:

(1) Is it possible to construct one ML ensemble method which aggregates ML classification methods and methods of future selection for attribute selection with the binary regression method and which demonstrated better characteristics of prediction than each individually of included in ensemble method?

(2) Can this new ensemble method be implemented in one multi-agent supported technological system?

To give an answer on these two research questions and confirm those two hypotheses, the authors used evaluation of the proposed ensemble model on the case study for the city of Niš, Republic of Serbia, using its meteorological and data for TI for ten years in the beginning of 21 century.

In order to realize the set goal and present that the proposed ensemble model is an effective solution for the considered problem of predicting traffic accidents, the authors realized the rest of this manuscript in the following way: after this first section, the Introduction, the second section follows: Materials and Methods, where the authors gave a description of the material used and a comprehensive review of applied methodologies; then comes the third section—Results and Findings, in which the results of applying the proposed model in the case study are described; in the next section—Technological Implementation of Proposed Ensemble Model, the authors described the implementation of the proposed model as one technical solution and at the end there is a fifth section—Conclusions, in which contributions of this research are given and future work on efficiently solving the problem discussed in this paper is proposed.

#### **2. Materials and Methods**

Having in mind the present development of improved solutions for the prediction of the impact of atmospheric parameters on the occurrence of traffic accidents which are computer based and often using the ML techniques existing today, it could be said that its group of mentioned ensemble methods is a trend in solving such a complex problem. Implementation of such solutions using multi-agent solutions follows this trend directly. However, in the literature, there is still not a large enough number of references which integrate more methods of ML, different or of the same type in the ensemble models of prediction, so additional research of such methods is needed; that was the motivation for the authors to develop one such novel method.

In this paper, the authors described not only the new proposed model but also its implementation as one of agents in one multi-agent system of emergent intelligence technique (EIT) for the purpose of one citizens warning system. For the evaluation of the model proposed as such, the authors conduct the material from case study for the City of Niš in the Republic of Serbia which is presented in this paper. In it, the analyzed material is classified so that all data in the period considered is divided into two classes: positive when the daily number of traffic accidents is bigger than the average value for this period, and negative in all other cases. This way, it could be said that the positive class includes the instances when conditions significant enough for the occurrence of traffic accidents on that day are present in the atmosphere.

#### *2.1. Methods*

The problem of predicting the impact of meteorological parameters on TI that is the subject of consideration in this paper belongs to the group of classification problems for whose solving two main groups of methods are available: the classic statistical methods of logical regression and ML based classification.

With the logistic regression model, we describe the relationship between predictors that can be continuous, binary, categorical and categorically dependent variables. For example, the dependent variable can be binary-based on some predictors; we predict whether something will happen or not. We actually estimate the probabilities of belonging to each category for a given set of predictors. Depending on the type of dependent variable, we have:

Binary logistic regression—the dependent variable is binary (for example: answer true or false on the questions);

Nominal logistic regression—the dependent variable has three or more categories that cannot be compared in value (for example, colors (white, black, red, green, blue, etc.);

Ordinal logistic regression—the dependent variable has three or more categories that can naturally be compared, but the ranking does not necessarily mean that the "distances" between them are equal (for example: health status (stable, serious or critical).

Logistic regression is used when the dependent variable takes only a finite set of values.

We wonder if we can still use linear regression in classification problems. In the case of binary logistic regression, we consider the dependent variable to be a Bernoulli random variable in notation Y as it is shown in Equation (1). Then, we have two categories that we code with: 0 for failure and 1 for success.

$$\mathbf{Y} = \begin{array}{c} 0-\text{failure} \\ 1-\text{success} \end{array} \tag{1}$$

Therefore, the dependent variable is a Bernoulli and not some continuous random variable, meaning that errors cannot be normal. Additionally, if we did run a linear regression, we would get some meaningless fitted values—values outside the set {0,1}. In the case of a binary dependent variable, one way to use linear regression for a classification problem can be as follows: for a given set of predictors, if the fitted value by linear regression is greater than 0.5, then we classify that observation as a success, and if not, a failure. This method is then equivalent to the linear discriminant analysis, which we will discuss later. With this method, we only get a classification for some observation: "success" or "failure". If the fitted values by linear regression are close to 0.5, then we are less confident in our decision. We will also say that, if the dependent variable takes more than two values, then linear regression cannot be used as we described a moment ago, but the linear discriminant analysis must be used instead.

ML is one comprehensive discipline based on statistical analysis and artificial intelligence and it is used for learning of knowledge, i.e., concrete learning of rules, concepts, models, etc. which should be understood and accepted by the people. In the ML process, it is obligatory to have some kind of evaluation of the validity of the knowledge learned in this process, i.e., some kind evaluation of obtained rules, concepts, or models. For this purpose, two evaluation methods are available based on the process in which the available set is divided in different ways into a learning set and a test set:


One of the most important measures of success of learned knowledge is named predictive accuracy. It is the ratio of the total number of successful classifications to the total number of classifications. For measuring a success of learned knowledge are also often used and precision, recall, F1 measure and receiver operating characteristic curve which will be described in the continuation of this chapter. The basic goal of any predicting process is to obtain one model based on the exact numerically determined combination of independent variables for the dependent variable. It is important to remark that in this process, the choice of variables that will be included in this process from a given data set affects the accuracy and other measures of the obtained prediction model, so because of that, it is necessary to use different techniques for a selection of variables in the data preparation phase, i.e., to apply some method of the so-called feature selection procedure.

One ensemble model of ML is proposed in this paper for predicting the potential risk of traffic accidents caused from meteorological, i.e., atmospheric parameters. As we already mentioned, the proposed method is one aggregation that optimizes more different classification algorithms using attribute reduction and binary regression. Implementation of this algorithm could represent one agent in considered and described two-agent system in this paper in which other agent realizes alarm calculations accordingly to value of meteorological parameters in real time. This two-agent system could be proposed in general as a wider and more complex multi-agent system which could be included and other types of environment parameters beside meteorological and which could be based on different possible forms of emergent intelligence for collective decision making. Such an implementation of an EIT solution as an emergency software tool could be realized as a web application accessible to all stakeholders of human society, starting with citizens and other interested parties.

The subchapters which follow in this paper are devoted for a brief description of these methodologies because the proposed ensemble method aggregates the method of logical regression with ML methods of classification methods and feature selection.

#### 2.1.1. Classification Methodology

Classification algorithms belong to the supervised ML technique and can be used for the task of predictive modelling. Using the classification methodology for this purpose implies the existence of labeled instances in each of more than one class (attribute) of objects so that it predicts the value of obligatory categorical type of class (attribute) using the values of the remaining predicting attributes [74].

The selection of the appropriate classification algorithm for the concrete considered application is not only the beginning but is also the most important place in the process of ML from big data. For solving the problem which is considered in this paper, in their proposed ensemble model, the authors use a classification which makes the classification into two classes, positive and negative that correspond to true or false in both of them. All possible outcomes of prediction are presented in the confusion matrix shown in Table 1.

**Table 1.** The confusion matrix for the two-class classifier.


The number of members in the considered set shown in Table 1 is the sum of positive and negative cases and will be classified in notation N, i.e., TP + FN + FP + TN = N. All results that are presented in Table 1, for a considered case of two-class classifier, can be given for the most important measures of classification accuracy, precision, recall and F1 measure with the following formulas:

$$\text{Accuracy} = (\text{TP} + \text{TN}) / \text{N} \tag{2}$$

$$\text{Precision} = \text{TP} / (\text{TP} + \text{FP}) \tag{3}$$

$$\text{Recall} = \text{TP} / (\text{TP} + \text{FN}) \tag{4}$$

$$\text{F1measure} = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} \tag{5}$$

In the evaluation of the prediction performance of any classifier, the Receiver Operating Characteristic (ROC) curve is also often used; it represents the value of false positive on the OX axis, and on the OY axis, the value of true positive cases [75,76], so that, for example, point (0, 1) represents perfect prediction, where all the samples are classified correctly, and point (1, 0) represents a classification that classifies all samples incorrectly. Therefore, it is important to known that the output in ROC space produced from naive neural networks or Bayes classifier is a probability which is a score-numeric value and discrete classifiers produce only a single point, but in both cases they represent the degree to which a particular instance belongs to a certain class [77]. The area under the curve (AUC) is the most-used measure of diagnostic accuracy of the model and AUC values greater than 70% have good classification processes.

Practically, classification is the task of ML, but can also be the task of data mining, which performs separation of instances of a considered data set based on the value of the input variables into one of pre-determined ones class of the output variable [78].

The literature review shows that the most commonly applied classifiers include Neural networks, Bayes networks, Decision Trees, K–nearest neighbor, etc. [79].

For the proposed model, the authors used some of the most-used classification algorithms which belong to five different groups of types as it is grouped in one of the most used software for this purpose, Weka [80], i.e., Bayes, meta, trees, rules and functions. Because of that, below this subchapter is a short description of one selected algorithm from each of the mentioned Weka classifiers groups.

The Naive Bayes classifier [81,82] from the Bayes group of Weka belonging to the group of oldest classification algorithms and generates a prediction model using Bayes' theorem. It is called "naive", because of that simplifies the problem of classification by two important assumptions, first that the attributes used in the prediction procedure are conditionally independent and with a known classification, and second, that there are no

hidden attributes that could affect the prediction. In this way, these assumptions allow an efficient classification ML algorithm. For conditionally independent attributes A1, ... , Ak probability for class attribute A is calculated using the following rule:

$$P(\mathbf{A}\_1, \dots, \mathbf{A}\_k | \mathbf{A}) = \prod\_{i=1}^k \left( \mathbf{A}\_i | \mathbf{A} \right) \tag{6}$$

The main advantages of the Naive Bayes classifier in relation to other classifiers are primarily efficiency, simplicty, and convenience for small data sets of data.

The LogitBoost classifier from the meta group of Weka is widely applied in practice because it has very good characteristics, primarily thanks to the boosting algorithm [83]. This classifier uses the principle that finding multiple simple rules could be more efficient than finding a single precise rule, and because of that, usually complex prediction rules. It represents, essentially, one general method for improving the accuracy of ML algorithms.

Decision trees [84] from the trees group of Weka is the most used classification technique, because it includes more possible ways of its construction that are very convenient for interpretation. The trees can be used with all kinds of classification attributes (categorical or numerical). ID3 [85] and C4.5 [86] are the most-used algorithms from this group of classifiers and from the trees in the Weka tool, one of the most known is tree J48.

The PART classifier from the rules group of Weka builds a partial decision tree so that it uses the C4.5 decision tree classifier in each of its iterations and constructs the best sheet and a suitable rule in the tree. This classifier does not belong to the group of oft-used classifiers, but it is useful in binary classification, as applied in this paper.

SMO from the functions group of Weka refers to the specific efficient optimization algorithm used inside the support vector machines (SVM) algorithm implementation. Practically, it solves the quadratic programming problem, which arises during the training of SVM on classification tasks defined on sparse sets of data. Additionally, it is not one of the oft-used classifiers, but it is used in this paper because it is appropriate for binary classification with numerical and binary types of attributes, which is the case in this paper.

#### 2.1.2. Logistic Regression

In ML, in many cases, probabilistic classifiers that return not only the label for the most likely class, but also the probability of that class, are needed. Such a so-called probabilistic classifier is well-calibrated if the predicted probability matches the true probability of the event which is of interest and can be checked using a calibration plot, which demonstrates how good a classifier is in a given set of data with known outcomes that is valid for the binary classifiers considered in this paper (in the case of multi-class classifiers, a separate calibration plot is needed for each of classes).

The authors used the idea of calibration, like many other authors did, as, for example, in [87], and as seen in [83], the univariate calibration using logistic regression for transforming classifier scores into probabilities of class membership for the two-class case.

The main goal of logistic regression is to obtain the best-fitting model for describing the relationship between the dichotomous characteristic of interest, which is a dependent variable (response or outcome variable) with a set of independent variables (predictor or explanatory variables).

Logistic regression generates the coefficients of a formula to predict a logit transformation of the characteristic of interest presence probability which can be notated as p (with determined standard error and significance level):

$$\text{logit}(\mathbf{p}) = b\_0 + b\_1 \mathbb{X}\_1 + b\_2 \mathbb{X}\_2 + \dots + b\_k \mathbb{X}\_k \tag{7}$$

The logit transformation is defined as the logged odds:

$$\text{odds} = \frac{\text{p}}{1 - \text{p}} = \frac{\text{probability of characteristic pressure}}{\text{probability of characteristic absence}} \tag{8}$$

and

$$\text{logit}(\mathbf{p}) = \ln(\frac{\mathbf{p}}{1-\mathbf{p}}) \tag{9}$$

In ordinary regression, the choosing of parameters that minimize the sum of squared errors is present, while in logistic regression it chooses parameters that maximize the likelihood of observed sample values. The regression equation coefficients are the coefficients b0, b1, b2, ... bk. The logistic regression coefficients show increasing (when bi > 0), and decreasing (when bi < 0) in the predicted logged odds for the independent variables. In the case the independent variables Xa and Xb are dichotomous, then the impact of these on the dependent variable is simply determined by comparing their coefficients of regression ba and bb. By taking the exponent for both sides in the regression equation as it is shown above, the equation can be given as one of form of logistic regression:

$$\text{odds} = \frac{\text{P}}{1 - \text{p}} = \text{e}^{\text{b} \cdot \text{e}} \text{e}^{\text{b}\_1 \chi\_1} \text{e}^{\text{b}\_2 \chi\_2} \text{e}^{\text{b}\_8 \chi\_8} \dots \text{e}^{\text{b}\_k \chi\_k} \tag{10}$$

From the given formula, it is evident that when a variable Xi increases by 1 unit, and all other parameters remain unchanged, then the odds will increase by a parameter ebi .

$$\mathbf{e}^{\mathbf{b}\_l(1+\chi\_l)} - \mathbf{e}^{\mathbf{b}\_l\chi\_l} = \mathbf{e}^{\mathbf{b}\_l\chi\_l} = \mathbf{e}^{\mathbf{b}\_l(1+\chi\_l) - \mathbf{b}\_l\chi\_l} = \mathbf{e}^{\mathbf{b}\_l + \mathbf{b}\_l\chi\_l - \mathbf{b}\_l\chi\_l} = \mathbf{e}^{\mathbf{b}\_l} \tag{11}$$

This factor ebi is the odds ratio (O.R.) for the independent variable Xi, and it gives the relative amount by which the odds of the outcome increase (O.R. greater than 1) or decrease (O.R. less than 1) when the value of the independent variable is increased by one unit.

Implementation of several methods for performing logistic regression can be found in statistical programs, of which IBM SPSS [88] is the most famous. This tool realizes three basic methods of binary regression and that is the enter method, stepwise method and hierarchical method. The enter method includes all the independent variables in the regression model together, stepwise methods include two categories of regression procedures-forward selection and backward elimination, and in the hierarchical method, the researcher themself determines the order of inclusion of independent variables in the model. Otherwise, all of the three methods are used to remove independent variables that are weakly correlated with the dependent variable. The authors use the standard enter method for the model proposed in this paper.

#### 2.1.3. Future Selection Techniques

Classification methods of ML are sensitive from data dimensionality and it is showed evidently that application of dimensionality reduction enables them giving better results. Selecting a suitable subset before the application of these methods finds a set of attributes which together achieve the best result.

Algorithms for feature subset extraction perform a space search based on candidate evaluation [89]. The optimal subset is selected when the search is complete. Some of the existing evaluation measures that have been shown to be effective in removing irrelevant and redundant features include the consistency measure [90] and the correlation measure [91]. The consistency measure seeks to find the minimum number of features that consistently separate the class labels into a complete set. An inconsistency is defined for two instances that have different class labels for the same feature values.


Embedded, which combine the qualities of the filter and wrapper methods and where, among others, ridge regression (as one technique for analyzing multiple regression data

that suffer from multicollinearity) and different types of decision tree based algorithms as BoostedTrees, RandomForest, NBTree, and so on, belong.

One of the free-to-use software that has an option that performs feature selection, reducing the amount of included attributes by applying different type algorithms, is the already-mentioned software tool Weka [80]. Because of that, this software was used to evaluate the proposed model on a selected case study. Practically, this evaluation results in determining the importance of factors that influence the risk of a traffic accident as well as for determining one prediction model using techniques such as regression and/or classification for this tasks.

Because the first two groups of methods are used in model which authors proposed in this paper these are described in short hereinafter.

#### Filter-Ranker Methods

Filter models rely on the general characteristics of the data to estimate the exclusion features of the learning algorithm. For some data set D, the filter algorithm starts the search by initializing a subset S1 (the empty set, the full set, or a randomly selected subset) and searches the feature value space using a specific search strategy. Each generated subset S is evaluated against an independent measure and compared to the previous best. If it is found to be better than the previous best, it is considered the current best subset. The search continues until a previously defined stopping criterion is met. The output of the algorithm is the last best subset and that is the final result. By changing the search strategy and evaluation measure, different algorithms can be implemented within the filter model. The feature selection process often uses the entropy measure as one characterization of the purity of an arbitrary collection of examples, and considers a measure of the system's unpredictability.

The entropy of Y is:

$$\mathbf{H}(\mathbf{Y}) = -\sum\_{\mathbf{y} \subset \mathbf{Y}} \mathbf{p}(\mathbf{y}) \cdot \log\_2(\mathbf{p}(\mathbf{y})) \tag{12}$$

At the same time, feature selection methods differ in how they treat the problems of irrelevant as well as redundant attributes [93].

For the proposed model, authors used the following five shortly described filter algorithms. Having in mind that the entropy could be a criterion of impurity in a training set S, it is possible to define a measure reflecting additional information about each Attribute which is generated by Class, and that is the amount by which the entropy of Attribute decreases [94]. This measure is named the information gain and, in abbreviation, is notated as InfoGain and favors variables with more values.

InfoGain evaluates the worth of an Attribute according to Class using the following formula:

$$\text{InfoGain}(\text{Class}, \text{Attribute}) = \text{H}(\text{Class}) - \text{H}(\text{Class}|\text{Attribute}) \tag{13}$$

where H is the entropy of information. The information gained about an attribute after observing class is equal to the information gained using observation in the reverse direction.

The information gain ratio, noted as GainRatio, is one so-called non-symmetrical measure that was introduced in the theory of feature selection to compensate for the bias of the already-described measure InfoGain [95]. GainRatio is one modification of the InfoGain that reduces its bias on different attributes and it is given with the following formula:

$$\text{GainRatio} = \frac{\text{InfoGain}}{\text{H(Class)}} \tag{14}$$

As it is given in Formula (13), when it is needed to predict some variable-Attribute, the InfoGain is normalized so that it is divided by the entropy of Class, and in vice versa. This normalization enables that the GainRatio values must be ever in the range [0, 1]. GainRatio = 1 means that the knowledge of Class completely predicts variable-Attribute, but GainRatio = 0 indicates that there is no relation between variable-Attribute and Class. The GainRatio favors variables with fewer values. Thus, for example, the decision tree classification algorithms C4.5 [96] and ID3 [97] use the GainRatio criterion to select the attributes that should be at every node of the tree.

FilteredAttributeEval is a classifier class for running an arbitrary evaluator on data that has been passed through an arbitrary filter which are structured based exclusively on training data. This classifier executes nominal and binary classifications with nominal, string, relational, binary, unary, as well as missing attributes.

SymmetricalUncertAttributeEval is a classifier which evaluates the worth of an attribute by measuring the symmetrical uncertainty with respect to the class.

SymmU(Class, Attribute) = 2 ∗ (H(Class) − H(Class|Attribute))/H(Class) + H(Attribute) (15)

This classifier executes nominal, binary, and classification of missing classes with nominal, binary, unary, as well as attributes.

ChiSquaredAttributeEaval is a classifier based on the chi-square test used to test the independence of two events so that, for the given data of two variables, we can obtain the observed count O and the expected count E and, using the Chi-Square measure, how expected count E and observed count O deviate from each other, which is shown in Equation (16):

$$\chi^2\_{\mathbf{c}} = \sum\_{\mathbf{i}} \frac{\left(\mathbf{O}\_{\mathbf{i}} - \mathbf{E}\_{\mathbf{i}}\right)^2}{\mathbf{E}\_{\mathbf{i}}} \tag{16}$$

In Equation (16) c is degrees of freedom, *Oi* is observed value and *Ei* is expected value whereby degrees of freedom refer to the total number of observations reduced by the number of independent constraints which are imposed with the observations, and having in mind definitions that the random variable follows chi-square distribution only if it can be written in the form of the sum of squared standard normal variables like it is given in Equation (17):

$$\chi^2 = \sum\_{\mathbf{i}} Z\_{\mathbf{i}}^2 \tag{17}$$

where Z*<sup>i</sup>* are standard normal variables.

Degrees of freedom refer to the maximum number of logically independent values, which have the freedom to vary. In simple words, it can be defined as the total number of observations minus the number of independent constraints imposed on the observations.

#### Wrapper Methods

In the case of these learning methods, certain modeling algorithms are used in order to evaluate subsets of attributes in relation to their classification or predictive power. It is a computationally very demanding procedure due to the frequent execution of the ML algorithm. It is practically necessary to evaluate the performance of the corresponding model for each subset of attributes, and the total number of subsets grows exponentially when the number of attributes increases. For these reasons, different search techniques are used from the group of greedy techniques, which represent an approach to solving the problem based on the best selected option available at that moment [98].

According to some of the classification frameworks [99], wrapper methods can be broadly classified according to the method of searching a set of attributes into deterministic and randomized wrapper methods. The first subgroup of wrapper methods—deterministic wrapper methods, use a complete strategy of attribute space search in one sub-subgroup and certainly give the best results with a very demanding time and sequential strategies or heuristic search in the second subgroup of deterministic wrapper methods. Another subgroup of wrapper methods consists of randomized methods, which in turn rely on stochastic search approaches. The authors chose and used five methods from this wrapper group of methods for the proposed model in this paper which are implemented in the Weka software, and those are: from the deterministic subgroup and sub-subgroup of complete strategy search the ExhaustiveSearch and sub-subgroup sequential strategies or heuristic search—three of them—Best First, LinearForvardSelection, and GreediStepvise, and from another subgroup of wrapper methods named stohastic-GeneticSearch, and all of them with CfsSubsetEval classifier.

	- I.1 The first subgroup are those with full search, and these algorithms usually showed the good results.

ExhaustiveSearch is the most well-known algorithm from this subgroup; it conducts an exhaustive search through the complete space of attribute subsets starting from the empty set of attributes. On end reports the best subset found.

I.2 The second subgroup of deterministic search wrapper methods is the group of algorithms with sequential search techniques which are the most-used wrapper algorithms, and because of that, the authors use primarily different algorithms from this subgroup in the proposed algorithm.

The BestFirst algorithm as a basic algorithm from this subgroup searches the space of attribute subsets by greedy hill climbing augmented with a backtracking facility. This algorithm may start with the empty set of attributes and search forward, or start with the full set, i.e., all attributes and search backward, or start at any point between those, and search in both directions.

The LinearForwardSelection algorithm is one Extension of the BestFirst algorithm. Takes a restricted number of k attributes into account. Fixed-set selects a fixed number k of attributes, whereas k is increased in each step when fixed-width is selected. The search uses the initial ordering to select the top k attributes, but can also use the ranking. The search direction is forward or floating forward selection with using optional backward search steps.

The Subset Size Forward Selection algorithm is one Extension of the LinearForwardSelection algorithm.

GreedyStepwise performs a greedy search in both directions, forward or backward, through the space of attribute subsets. It may start with no or all attributes, or from an arbitrary point in the space. It stops in the moment when the addition, i.e., deletion of any remaining attributes results in a decrease in evaluation. It can also produce a ranked list of attributes by traversing the space from one side to the other and recording the order that attributes are selected in.

II. Algorithms from the group of stochastic search of wrapper methods

The most-known from this group is the genetic algorithm which the authors use in the proposed algorithm as representative of that subgroup of methods. This algorithm belongs to a wider class of so-called population methods, i.e., evolutionary algorithms that use stochastic optimization. Genetic algorithms only select the initial population at random; in later steps, the selection procedure is strictly defined. The steps of the genetic algorithm are iteratively repeated until the desired target is reached value, i.e., the stopping criterion of the algorithm.

As we already mentioned, all of the wrapper methods used are applied in the proposed model with using the CfsSubsetEval classifiers (Correlation-based feature selection).This method ranks and selects the attribute sets with biases towards to subsets containing features that are highly correlated with the class, and at the same time, they are uncorrelated with each other. Measuring the significance of attributes in this method is on the basis of predictive ability of attributes and their redundancy degree.

2.1.4. Ensemble Method for Prediction of Meteorological Impact on Occurrence of TI

It is known that, in ML, methods that use several individual aggregated algorithms to achieve better results than those that would be achieved with any of the algorithms individually aggregated into it are called ensemble ML methods. To solve the predictive problem which is considered in this paper, the authors proposed an ensemble algorithm showen with the procedure, which is given in Algorithm 1 and showed as the block schema in Figure 1.

#### **Algorithm 1**: Obtaining significant predictors of TI caused by atmospheric factors

1. Perform a logistic-binary regression Enter method for a model in which n atmospheric factors are predictors and the dependent variable is the number of TI logically determined by a threshold, which could be a value greater than 150% of the average value of daily TI for considered case study, and has a nominal value 1 in that case and 0 in all others. We start the algorithm in first cyclei=1 with referent value which represents the number of attributes which is in start step number noted as n1–in concrete case study n1 = 27. In the Enter method of binary regression used, all of the predictors will be included in the prediction; only in the possible presence of impermissible collinearity of certain predictors, they will be excluded from the model. After that, using the Cox and Snell R Square and Nagelkerke R Square test, the algorithm will determine the value of the percentage of the variance that is explained, i.e., the connection between the tested factors and the dependent variable, and using the Hosmer and Lemeshow tests, the algorithm will determine its goodness-of-fit, i.e., the adaptation of the model to the given data, i.e., calibration which will evaluate the goodness of the proposed ensemble model in this and in the later steps, including the most important last step of the proposed algorithm in order to use the AUC to determine the quality measure of the classification binary regression analysis model. 2. Apply a set of at least five methods of classification which belong to different types of classification (for example, how it is already mentioned in Weka software, so any five, each from different types—Decision trees, Bayes, Meta, Rules, Functions, MI, etc.) and find two classification algorithms from this set that has the highest value of AUC among other algorithms used (also other parameters such as precision, recall, and F-measure which are with good values). That classification algorithm will be used in the step that follows in which attribute selection is carried out to select the best of several used attribute selection algorithms from two different types of groups.

The values of Hosmer and Lemeshow test and even more significant AUC values that determine the threshold of whether the desired level of goodness of the model has been reached—take the values determined in steps 1, i.e., step 2 of this algorithm, respectively.

3. Using five algorithms from each of both groups of feature selection methods is with the basic aim to use in this ensemble classification algorithms that are good and eliminate bad characteristics:

3.1. Using at least five of the mentioned attribute selection algorithms from both the wrapper and the filter groups more broadly explained in Section 2.1.3. of this paper, perform attribute classification in one class of the two possible classes of instances which are defined in step 1 of this algorithm and according to the criterion of whether the value of this attribute exceeds or does not exceed the daily TI threshold.

3.1.1. Classifiers for filter attribute selection could be any five different algorithms: for example Information-Gain Attribute evaluation, Gain-Ratio Attribute evaluation, Symmetrical Uncertainty Attribute evaluation, Chi-Square Attribute evaluation, Filtered Attribute Eval, Relief Attribute, Principal Components, etc. The authors used the first five of these in this paper. Those chosen algorithms are used to determine the feature subset of attribute A = { ... , ai−1, ai} and their ranks from the starting set A = {a1, a2,... ,an}, i ≤ n. It is necessary to remark that n is the starting number of attributes in such a way that the decision to exclude a particular attribute is made by the majority of exclusion decisions made individually by each of the algorithms.

3.1.2. Classifiers for wrapper attribute selection can be any five from this group of algorithms: for example Best First, Linear Forward Selection, Genetic Search, Greedy Stepwise, Subset Size Forward Selection, etc. The authors used the first five of these in this paper. Those chosen algorithms are used to compute a subset A = { ... , aj−1, aj} from the starting set A = {a1, a2, ... , an}, j ≤ n. It is necessary to remark that n is the starting number of attributes in such a way that the decision to exclude a particular attribute is made by the majority of exclusion decisions made

individually by each of the algorithms. 3.2. Determine a subset A = A ∩ A = { ... , am−1, am} from the starting set A = {a1, a2, ... , an}, m ≤ i, j, n, where n is the starting number of attributes and i and j values determined in the previous steps of the algorithm 3.1.1 and 3.1.2

#### **Algorithm 1**: *Cont.*

We could have, at the end of this step, not only a different number of selected attributes using both groups of attribute selection algorithms considered as it is given in 3.1.1. and 3.1.2., possibly different notated attributes as well, and that is why we use the intersection operation for these obtained subsets A and A , which determines only common attributes as those that will be removed from the initial, i.e., in later cycles from the observed set A.

3.3. If m < n exists, which is determined in the previous step 3.2., and Hosmer and Lemeshow test determined the goodness of the algorithm as positive, the algorithm continues with the next step 4 using set A ={... ,am−1, am} attributes; otherwise, finish with the prediction which determined the existing number of parameters which was in the observed set.

4. Choose one from five filter classifiers with the smallest number of attributes li which has the highest AUC value using for that already determined two classification algorithms in step 2 of this algorithm.

5. Perform the binary regression Enter method again now with a smaller number of attributes li selected in step 4 of this algorithm, and if the values of Hosmer and Lemeshow tests are worse than those obtained in the previous test executed in step 3 of this algorithm or the obtained number of attributes satisfied value preset in advance, the procedure is finished; otherwise the procedure continues cyclically with step 3 of this algorithm with new set referent value. Preset value of the number of selected attributes on the specific need for each case separately and for the case study in this paper, the authors chose it at less than 15% from the starting number of attributes.

**Figure 1.** Block schema for the procedure which is described with Algorithm 1.

#### *2.2. Materials*

The weight coefficients determination applied in this study used the data covering the period from 1992 up to 2009 of atmospheric factors and daily traffic accidents related to the City of Niš, Republic of Serbia. The atmospheric data used in this case study is for twentyseven variables. Data used in this study was derived from several sources. Atmospheric data was obtained from the Republic Hydro-meteorological Institute for 1992–2009, and the database of the number of daily traffic accidents for the same period was supplied by the Ministry of Interior of the Republic of Serbia. All of this data is given as a Supplementary File in which the dependent variable is given in the excel table as twenty eighths, which is shown in the table—Table 2. In order to conduct the case study more efficiently, the dates were organized on daily level in the period of eighteen years which the authors consider in the case study of this paper.


**Table 2.** Atmospheric parameters used in case study.

#### **3. Results and Findings**

Prediction of the impact of the meteorological factors on the appearance of traffic accidents is realized in this paper using the meteorological and the traffic factor data related to the City of Niš, Republic of Serbia. The data is for the period from 1992 up to 2009 from which twenty-seven variables are used for meteorological and one variable was available which represents the number of daily traffic accidents. Meteorological data used in this study was derived from the Republic Hydro-meteorological Institute and the database of daily traffic accidents was supplied by the Ministry of Interior of the Republic of Serbia. All variables are given in Table 2 and the case study is realized with the data attached in the excel table Mathematics-NovoTrafficAccidentsNaj1.

The authors had in mind that the basic aim of each prediction process is to create a model that, using a suitable combination of independent variables, draws conclusions for the dependent variable. Bearing in mind the task set in the research which is the subject of this paper, we prepare the data for daily traffic accidents in binary form. As it is mentioned in this paper, the value of the dependent variable take value logic-exactly, i.e., binary-1 in the case that the number of daily traffic accidents is greater than 10, which is about 150% of the mean value for the considered period.

#### *3.1. Application of Proposed Algorithm of Ensemble Learning*

In the first step according to the steps from algorithm 1, a binary regression procedure using SPSS 17 tool [88] was carried out on the available data. All 27 meteorological parameters are used as predictors and the dichotomous variable of daily traffic accidents is used as dependable variable.

The results of applied binary regression obtained are shown in Table 3.



Sig > 0.05 indicates that the data fit the model.

The result shows that the model of logistic regression using all the 27 meteorological factors monitored explains the considered problem with the 1.3 percent of variance by Cox and Snell and 2.5 by Nagelkerke, which indicates its insignificant connection with the data (bigger than 0 and less than 0.3) [100], the Hosmer and Lemeshow test value 0.143 indicates that the data fit with the model (because Sig. > 0,05); what this means is that the model is well calibrated [101] and also that the model is without excluding any of these 27 parameters because of correlation. Given that 468 instances that cause an increased number of TI and 3522 that did not are identified in the examined sample, the accuracy of the classification by random selection is (468/3990)2 + (3522/3990)2 = 0.7929, which is 79.29%, so it can be seen that the model of binary logistic regression analysis with 88.3% has a higher classification accuracy than random selection models [102]. As the quality of the model significantly determines the value of the AUC [103], the value of that measure is determined in a separate following step of the proposed model.

In the second step of the proposed Algorithm 1, five classification algorithms applied each from different types that were chosen by the authors for this purpose in this paper are Naive Bayes, J48 Decision Trees, SMO, LogitBoost, and PART algorithms. The method of 10 folds cross-validation test was applied in the model estimation. The performance indicators of five classification algorithms are given in Table 4, which shows that the LoogitBoost and classifiers achieved the most accurate prediction results especially having in mind that the most important measure is AUC value.


**Table 4.** Performance indicators—classification using all 27 parameters.

As presented in Table 4, the LogitBoost and Naive Bayes classifiers achieved the two highest values for AUC at 0.547 and 0.541, respectively, and also the next similar values for other measures of classification, i.e., accuracy of 77.9 and 80.9%, recall 88.1 and 82.7% and F1 measure of 82.7%, and 81.7%, respectively, which implies that between these two classification algorithms, there will be one which will order predictors with highest value of AUC for the smaller number of attribute subset.

In step 3 of the proposed algorithm, the process of attribute selection by searching the attribute subsets using evaluation with two types of this method and that filter and wrapper type is realized.

#### 3.1.1. Filter

Filter feature subset evaluation methods were conducted with a rank searching to determine the best attribute subset, and they are listed as follows:


The ranks of considered parameters obtained by the above three methods on the training data are given in Table 5 where the four attributes that are selected are presented: V7, V13,V-15 and V-20.

**Table 5.** Feature selection using five filter ranker classifiers (smaller serial number represents bigger rank of factor).



**Table 5.** *Cont.*

#### 3.1.2. Wrapper

Wrapper feature subset evaluation methods were conducted without rank searching to determine the best attribute subset, and they are listed as follows:


The obtained results presented in Table 6 shown that the same four attributes, V7,V13, V-15 and V-20, were selected using five wrapper algorithms as it was the case with five filter classifiers.



In substep 3.2. of the proposed algorithm, we determine the selected attributes as a set operation, the intersection of a subset of the selected attributes using the filter and wrapper methodology, and based on the obtained results, we notice that in our case study, we are talking about the same four attributes: V7, V13,V-15 and V-20, i.e., m = 4.

We examine the next subset 3.3 and determine that there are four selected which is less than the initial 27 attributes in the model and at the end of this substep, we check the goodness of the model used, whose results are given in Table 7.


**Table 7.** Results of the binary regression Enter method using the 4 selected attributes.

Sig > 0.05 indicates that the data fit the model.

The result shows that the model of logistic regression is taking into consideration feature of the selected 4 meteorological parameters to explain the considered problem with the same accuracy of classification value of 88.3%, with the 0.6 percent of variance by Cox and Snell The result shows that the model of logistic regression taking into consideration selected 4 meteorological parameters explain considering problem with same accuracy of classification value of 88.3% as when uses all 27 parameters, i.e., 1.2 by Nagelkerke without excluding any of this parameters because of correlation and with Hosmer and Lemeshow test value 0.189, which is evidently better than the results obtained in step 1 when it was used for all 27 attributes in regression model.

Additionally, we can see in Table 8 that the classification measure values for determined two best classification algorithms have better characteristics than results obtained in step 2 of this algorithm presented in Table 4.

**Table 8.** Performance indicators obtained by the classification algorithms using 4 parameters.


Results given in Tables 7 and 8 clearly show that the applied two groups of filter and wrapper methodologies with five specific algorithms each with reduced dimension from 27 parameters to only four attributes, i.e., variables show good results of correctness of such a reduced model, and because of that, we can continue with step 4 of the proposed algorithm 1; otherwise, that would be the end and exit from the procedure with undone dimensionality reduction.

In step 4 of proposed algorithm 1, we generate a diagram with AUC values depending on the number of attributes used, for the best classification algorithm Loogitboost which is determined on the basis of the results given in Tables 4 and 8 and on the basis of results for each from the chosen five filter classifiers given in Table 5. The x-axis shows the number of attributes, and the y-axis shows the AUC value of feature subset generated for each of the five filter classifiers. In this way, we determine if the best results for the AUC measure we can obtain with a decreasing number of attributes, in our case the number of four attributes determined in step 3 and that using the ranking of the subset of attributes obtained with SymmetricalUncertAttributeEval classifier where it should be noted that the GainRatio classifier gives the same ranking of attributes. The rank of each of the 27 attributes which is obtained with SU and GR classifier determines the order of elimination of each one individually starting attribute and begins with the one with the lowest 27th rank and a suitable value of AUC determined using the LogitBoost classification algorithm. At the end in the diagram shown in Figure 2., it is clearly presented that three is the minimal number of used attributes with the maximal value of AUC for the number of attributes smaller than the four determined in the previous third step of the algorithm, also taking into account other classification measures, and in this way, determines the definitively chosen feature subset.

**Figure 2.** Diagram for determining maximum AUC value of classification for minimum number of attributes.

As we can conclude using results from the Diagram presented in Figure 2 and results determined with the best filter classifiers SU, i.e., GR given in Table 5, in this step of the algorithm, we obtain an added decrease of the selected attributes which will be included in the prediction formula in the following three: V13-Relative humidity at 14 o'clock in percent, V7-Daily temperature amplitude in ◦C and V20-Mean daily wind speed in m/sec.

The LogitBoost algorithm of classification shows, evidently, the best results in each of the measures including the AUC value for a reduced number of the three attributes mentioned as it is given in Table 9.

**Table 9.** Result evaluation of LogitBoost classification using all 27, 4 and 3 parameters.


In the last step 5 of Algorithm 1, a logistic regression is carried out, as in steps 1 and 3, to check the goodness of the model with 3 parameters selected in the previous step 4, and the results are given in Table 10.


**Table 10.** Results of applied logic regression with the selected subset of 3 parameters.

Sig > 0.05 indicates that the data fit the model.

The result shows that the model of logistic regression taking in consideration the three selected meteorological parameters explain the considered problem with the 0.6 percent of variance by Cox and Snell, i.e., 1.2 percent of variance by Nagelkerke without excluding any of these parameters because of correlation and with Hosmer and Lemeshow test value 0.234 which is evidently better than the results obtained in step 3 when it was used with four attributes in the regression model. Because of that, we can continue with step 3 of the proposed algorithm 1 to check eventual further decreasing of attributes; otherwise, it would be the end and exit from the procedure with dimensionality reduction done to this moment. However, in the end of this last step of the proposed algorithm, before continuing the algorithm with step 3, it is obligatory to check the case that it is obtained value preset in advance. This is the case in our paper, because the reduced number on three important attributes is smaller than the preset threshold value which is four attributes, so this fact implies the exit from the procedure in our case study.

For the concrete considered case study in this paper, the predictive formula is as follows:

$$-2.896 + 0.025\text{V}7 + 0.015\text{V}13 - 0.191\text{V}20 \ge 10\tag{18}$$

#### *3.2. The Model of Emergent Intelligence as One Implementation of the Proposed Ensemble Method*

As the authors had already mentioned in the introduction of Section 2. Materials and methods, in this paper they described not only the new proposed model but also its implementation as one of the agents in one multi-agent system of the emergent intelligence technique (EIT) for the purpose of one citizens warning system—Figure 3. In this respect, let us mark the task of giving a warning to those from one region or big city interested in the meteorological parameters that has reached the existence of conditions which affect the increased possibility of traffic accidents in this concrete place with T. The task is performed on the basis of measuring the values of all parameters included in the proposed model in this paper with real time and obtaining historical data of those values from specialized electronic data sources. In carrying out the set task, it is obligatory to use suitable prediction models as well as the proposed model in Section 2.1.4 of this paper and the data in real time.

**Figure 3.** EIT two-agent system for generating a warning of the possibility of traffic accidents.

That is why we divide the set task T into 2 subtasks for the model of the two-agent EIT system, and these would be the tasks: T1, which determines the warning of the existence and possibilities of increased traffic accidents based on a prediction from historical data, already described in Section 2.1.4 using the proposed ensemble model of ML and prediction formula given with Equation (18), and subtask T2, which determines the existence of that possibility based on the given exceeding or undershooting pre-set values for some of the most important meteorological parameters in real time like Temperature (≤4 or ≥30), Precipitation (≥40 mm), Snowfall (≥0 mm), and Visibility (≤100 m). In the proposed EIT, the two-agent system in Figure 3, the decision matrix realizes one warning alarm which is in the node of EIT where the main task T is solving, using the already-solved agents tasks T1 and T2, and this matrix is given in Table 11 on a way to generate the red alarm in the case that both agents T1 and T2 give a warning; the yellow alarm is generated if only one of them gives a warning, while there is no warning if neither of them gives a warning.


**Table 11.** Decision matrix of EIT for generating a warning of the possibility of traffic accidents.

#### **4. Technological Implementation of the Proposed Ensemble Model**

The technical implementation of the proposed solution implies the implementation of the considered two-agent system EIT with additional indication of the possibility in the future for different implementations in a more complex and multidimensional agent system, and some specific types of parameters that could and should be included in such a system are listed. The proposed technical solution is considered through two subsections in this paper—architecture and implementation.

#### *4.1. The Architecture of the Proposed Technical Solution*

The proposed technical solution is client server architecture which uses Firebase as the cloud messenger service in the proposed solution, and can also be used as a real-time database in the Backend-as-a-Service application development platform. In this architecture, Firebase connects user applications from client side with the server application on the server side consisting of four modules noted as Agent 1, Agent 2, EIT and notification module.

The user application works on the client side in this solution. The user application, which is the client application, works with different mobile operating platforms such as IOS and android, android auto, Google assistant driving mode; the same story with Apple devices and car-play systems, and the authors realized it in the proposed solution on IOS. During the installation, the application requests permission from the user to track the location in the background. Then, the server application, specifically the notification module, requests a list of hydro-meteorological stations with their geo locations, as well as data on topic names for the defined alarms. Since the topic, among other mentioned parameters, is made up of the name of the hydro-meteorological station, the nearest station is determined based on the current geolocation of the user and the geolocation of the hydro-meteorological stations. After selecting a hydro-meteorological station, the user fills in information about the type of alarm he is interested in, and more precisely, what type of vehicle they drive and whether they wear glasses. Based on this data, a topic is created to which the user application logs on to Firebase. Furthermore, the application monitors the change of location, and with each change, determines whether there is a station that is closer than the currently selected one; when this happens, the application logs out of the previous topic and logs in to the new one. Additionally, in the case when the notification module sends a notification for a topic for which the client has registered, that notification is displayed to the user.

The notification module serves to provide the client application with data about hydrometeorological stations and their locations, as well as other options for determining the topic. Additionally, when the EIT module from the server determines this, this module addresses Firebase and forwards a notification to all users logged in to the topic defined.

Agents 1 has a database of historical data that it uses to generate an alarm according to the prediction formula that is generated by the proposed prediction model from this paper that takes meteorological conditions into account. In this way, it decides whether to raise an alarm notated T1. Thereby, the historical data is updated by the Hydro-meteorological Institute of the Republic of Serbia and from the Ministry of Interior of Serbia, the number and place of the accident-city, i.e., the number of roads. The data is given to clients and official members of the MUP, in which case the EIT generates a report that includes cases of binoculars and not both truck and car and gives such a report to an official person—that is, in all four variants.

Agents 2 decides whether to raise the alarm based on the defined rules and the current situation. Agent 2 generates an alarm T2 in the logical function of meeting the meteorological conditions of the given conditions in the image for temperature, rain, snow, wind, and fog, as well as the type of vehicle (truck or car) and visibility.

The EIT module addresses the agents at a defined time and takes answers from them. Based on the answers received, it forwards an alarm to the notification module for groups that need it, which then forwards notifications to Firebase.

#### *4.2. The Implementation of Proposed Technical Solution*

The implementation of the proposed technical solution, which is based on the diagram given in Figure 4, is realized with attached program codes for each part included in the proposed EIT system separately with server-TrafficIncidents-master and clients application-TrafficAccidentPrevention. Software implementation of the proposed technical solution is realized using Python as a widespread software platform (see Algorithm 2).

**Figure 4.** EIT two-agent system implementation—generating a warning of the possibility of traffic accidents.

```
Implementation of the Agent 1 that generates alarm T1
    Data:
    V7 is the daily temperature amplitude in degrees Celsius
    V13 Relative humidity at 14 o'clock in percent
    V20 is Mean daily wind speed m/sec
    if (−2.896 − 0.025v7 + 0.013v13 − 1.191v20 > 10)
    Alarm T1 = 1
    else
    Alarm T1 = 0
Implementation of the Agent 2 algorithm that generates alarm T2
    Data on the current hydro-meteorological situation:
    temp-current temperature INT
    fog-presence of fog BOOLEAN
    wind-wind speed in m/s INT
    snowfall-is it snowing BOOLEAN
    rain-is it raining BOOLEAN
    cloudiness-is it cloudy BOOLEAN
    User data:
    tracks-whether they drive a truck or a bus BOOLEAN
```

```
cars-do they drive a car BOOLEAN
farsightedness-whether they wear glasses while driving BOOLEAN
if (
(temp ≤ 4)
or (temp ≥ 30)
or (fog and farsightedness)
or (rain)
or (wind ≥ 50 and tracks)
or (wind ≥ 65 and cars)
or (snowfall)
or (cloudiness and farsightedness)
)
Alarm T2 = 1
else
Alarm T2 = 0
```
**Algorithm 2:** Implementation of the EIT algorithm that generates alarm EITalarm

T1 and T 2 agent alarms **f (T1 = 1 and T2 = 1) Red alarm else if (T1 = 1 or T2 = 1) Yellow alarm else Green alarm-no alarm**

#### **5. Conclusions**

The authors had two main aims in this paper which was directly connected with proving two set hypotheses. The results of the research with the proposed ensemble method of aggregation of five methods from different classification groups of algorithms and binary regression algorithm confirmed the first set hypothesis. It could be concluded that it is possible to aggregate several classification methods and include several feature selection methods into one ensemble method with better characteristics than each individually installed method when it is applied alone to solve the same task. Thereby, each used classification methods of ML belongs to a different type of classification algorithms, and also, each algorithm of attribute reduction belongs to different types of feature selection algorithms. The authors also gave an answer on the second hypothesis set and the question: Is it possible for such potential obtained ensemble method to be implemented in one multi agent system? They did it in a way that they proposed one technological system supported with emergence intelligence as one good framework for the implementation of the proposed model defined with the algorithm described.

The authors confirmed those two hypotheses using the results obtained in the case study conducted for the data for the City of Niš in the Republic of Serbia and these were evaluated using a 10-fold cross validation for each of the applied algorithms in Weka software.

The authors have claimed that the proposed model has not demonstrated significant limitations. The authors will deal with it by examining the inclusion of a greater number of types of classification groups and feature selection algorithms and the inclusion of nmodular redundancy into the construction of the proposed ensemble algorithm in their future work related to this topic. Moreover, the authors will also consider the implementation of the proposed model in multi agent systems with more than two included agents based on the emergence of intelligence technology and for obtaining better prediction models for TI for solving similar prediction problems in different fields of human life.

**Supplementary Materials:** The following supplementary materials which are mentioned and used in this paper are available online at http://www.diplomatija.com/nastavni-kadar/prof-dr-draganrandelovic/mathematics-work-in-progress/ (accessed on 15 December 2022).

**Author Contributions:** A.A.: resources, data curation, funding acquisition, software, validation; M.R.: investigation, writing—original draft, formal analysis, project administration; D.R.: conceptualization, methodology, writing—review and editing, visualization, supervision. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors thank the Faculty of Diplomacy and Security, University Union Nikola Tesla, Belgrade, Republic of Serbia for their support in the publishing of this paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

**Snezhana Gocheva-Ilieva \*, Atanas Ivanov, Hristina Kulina and Maya Stoimenova-Minova**

Faculty of Mathematics and Informatics, Paisii Hilendarski University of Plovdiv, 24 Tzar Asen St, 4000 Plovdiv, Bulgaria

**\*** Correspondence: snow@uni-plovdiv.bg

**Abstract:** In this study, a novel general multi-step ahead strategy is developed for forecasting time series of air pollutants. The values of the predictors at future moments are gathered from official weather forecast sites as independent ex-ante data. They are updated with new forecasted values every day. Each new sample is used to build- a separate single model that simultaneously predicts future pollution levels. The sought forecasts were estimated by averaging the actual predictions of the single models. The strategy was applied to three pollutants—PM10, SO2, and NO2—in the city of Pernik, Bulgaria. Random forest (RF) and arcing (Arc-x4) machine learning algorithms were applied to the modeling. Although there are many highly changing day-to-day predictors, the proposed averaging strategy shows a promising alternative to single models. In most cases, the root mean squared errors (RMSE) of the averaging models (aRF and aAR) for the last 10 horizons are lower than those of the single models. In particular, for PM10, the aRF's RMSE is 13.1 vs. 13.8 micrograms per cubic meter for the single model; for the NO2 model, the aRF exhibits 21.5 vs. 23.8; for SO2, the aAR has 17.3 vs. 17.4; for NO2, the aAR's RMSE is 22.7 vs. 27.5, respectively. Fractional bias is within the same limits of (−0.65, 0.7) for all constructed models.

**Keywords:** air pollution; machine learning; random forest; arcing; ARIMA errors; MIMO averaging strategy; multi-step ahead prediction; unmeasured forecast

**MSC:** 62-07; 62P12

#### **1. Introduction**

Air pollution is a major and worsening environmental problem in many countries worldwide. The systematic accumulation of harmful aerosols in the air in populated areas is the cause of many diseases among their inhabitants. It leads to undesirable changes in the climate, forest, land, and all vital ecological systems [1,2]. Particulate matter, in particular, is dangerous for human health, examples include PM10 (with a diameter of less than 10 microns), PM2.5 (with a diameter of less than 2.5 microns), nitrogen dioxide (NO2), sulfur dioxide (SO2), ground-level ozone (O3), and others. Numerous studies have established the harmful influence of elevated concentrations of pollutants in ambient air, leading to heart disease, acute respiratory infection, chronic obstructive pulmonary disease, allergic dysfunction, lung cancer, and more [3,4]. Even at low concentrations, the presence of a constant background of polluted air is dangerous, primarily for small children and the elderly, as well as for the chronically ill [5]. The leading causes of poor quality air in populated areas can be conditionally divided into two large groups. On the one hand, low-quality air is a product of the anthropogenic sources of increased concentrations of pollutants due to human activity, such as production facilities, power plants, car traffic, household combustion, and others [1,6]. This type of air pollution source has a relatively constant character. Weather and atmospheric conditions are the other major factors affecting air quality. With the adverse trend of climate change, they are becoming increasingly chaotic and unsustainable.

**Citation:** Gocheva-Ilieva, S.; Ivanov, A.; Kulina, H.; Stoimenova-Minova, M. Multi-Step Ahead Ex-Ante Forecasting of Air Pollutants Using Machine Learning. *Mathematics* **2023**, *11*, 1566. https://doi.org/10.3390/ math11071566

Academic Editor: Ioannis G. Tsoulos

Received: 14 February 2023 Revised: 6 March 2023 Accepted: 21 March 2023 Published: 23 March 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Air pollution forecasting is a non-trivial task in which the atmospheric pollution concentrations for a given location and time are expected to be predicted based on existing data measurements of various factors. For research purposes, they are divided into global, regional, and local. From the point of view of the individual in the society, what is of practical value is the impact of local factors, more specifically, a given settlement, and taking into account the local climatic, geographical, industrial and other types of characteristics and factors affecting the degree of pollution and the consequences related to human health [7]. Standard solutions for this task produce computer numerical models for simulating atmospheric chemical composition and atmospheric dispersion modeling systems based on mathematical and chemical equations describing pollutant transport and diffusion chemical processes [8–11]. The practical implementation of this type of numerical model presupposes the presence of detailed input information on current air quality, monitored by local stations, remote sensing, forecasted weather conditions, data on the geographical terrain, and others. These models are complex, and their creation requires significant computing resources.

A promising alternative to numerical modeling is a large group of methods for environmental pollution modeling inspired by machine learning (ML). ML allows the construction of effective predictive models that can be easily implemented into mobile applications. Popular classical statistical methods for linear type time series include multiple linear regression (MLR), nonlinear regression, parametric type stochastic methods, Auto-Regressive Moving Average (ARIMA), GARCH, and many of their variants [12–20]. A recent large-scale, extensive study using spatial interpolation and MLR found close air pollutant-meteorological interactions in China [12]. Another study also focused on the importance of meteorological conditions on air pollution [13]. In one study [14], univariate SARIMA models were built with intervention variables to reflect the outliers for PM2.5 and PM10. The MLR is implemented in [15] for PM10, depending on temperature changes. In another study [16], the MLR, Loess seasonal and trend decomposition with ARIMA, and SARIMA models are built and compared in the forecast of PM10. Other studies have incorporated the influence of atmospheric factors and pollutant modeling with MLR, ARIMA, and integrated ARFIMA [17,18]. The hybrid ARIMA-GARCH models of PM10 concentrations were used in another study [19]. It should be noted that in the presence of the linear nature of the studied time series, linear methods can provide adequate and sufficiently good prediction without yielding to advanced ML algorithms.

In recent years, more and more studies have used high-performance ML techniques capable of extracting the hidden relationship between the input data and the regression prediction target as well as predicting with great accuracy empirical data of any kind. The main methods of this class are Neural Network (NN) regression, Recurrent Neural Network (RNN), Multilayer Perception (MLP), Deep Learning, Random Forest (RF), Classification and Regression Trees (CART), Multivariate Adaptive Regression Splines (MARS), Support Vector Machine (SVM), Genetic algorithm (GA), and more, including hybrid ones. The complex relationships among air pollutants and meteorology are quantitatively revealed in [21] using RF analysis. Additionally, RNN-based accurate forecasts of pollutants, including SO2, NO2, CO, PM2.5, PM10, and O3, were obtained. An ensemble approach is applied in [22] for predicting fine particulate matter (PM2.5) in the greater London area. Many models, including ensemble RF, bagging, and additive regression, were built and compared to single models with SVM, MLP, linear regression, and regression trees [23] to predict NO2 concentration levels. It has been shown that ensemble models statistically outperform other models. Four ANN ensemble models and an innovative Fuzzy Inference Ensemble (FIE) model were proposed in [24] and are capable of estimating the concentration levels of O3, CO, NO, NO2, SO2, PM10, and PM2.5. In [25], a stacked ensemble model is developed for forecasting the daily average concentrations of PM2.5 in Beijing, China, based on levels of other pollutants and meteorological data. The base models in the stacking strategy are built with LASSO, AdaBoost, XgBoost, and GA-MLP. Their predictions are stacked using SVM. It has been shown that the resulting stacked model outperforms all base models.

In [26], a hybrid forecasting model was developed by incorporating the Taylor expansion to correct the residuals of traditional ANN and SVM models based only on the local meteorological data used as input variables. The experimental results of forecasting the average daily concentrations of PM10 and SO2 have shown that the forecasting accuracy of the proposed model is very satisfactory. Other studies in the class of ML methods considered are [27–30]. There are other approaches to improving regression models' accuracy, particularly residual correction using ARIMA, as described in [31]. This approach is also suitable for ML predictive models, as demonstrated in [32,33]. Additional information on ML approaches and algorithms in the field can be found in review papers [34,35].

The primary purpose and application of regression models is to use them to forecast for a period of time called forecasting horizon. That is, the forecasting horizon is the length of time into the future for which forecasts are to be determined. They compare the preponderance of research extracts from known historical data outside of the working samples. The criteria for the accuracy and other qualities of the forecast for the selected horizon are different—here, there are no generally accepted established standards or theoretical results. In short-term prediction, the tested and selected model is usually used once to predict the level of concentrations for a fixed short horizon. Multi-step ahead forecasting (long-term) strategies are much less often applied. In this case, forecasting is done in successive steps, with the forecasting horizon shifted forward in time, either step by step or all at once with several steps.

However, in an actual situation, the researcher usually does not have the necessary information and the exact values of the independent factors (predictors) for the regression models since they will be measured in the future. When forecast data is used for the predictors, we talk about ex-ante ahead forecasting. It is natural to expect that the model's predictions will be affected by the corresponding uncertainty of these data. Evaluating the capabilities of ex-ante forecasting models is an open research problem, particularly for advanced ML approaches, to which this paper is devoted.

This study aims to develop a new multi-step ex-ante forecasting strategy for multivariate time series based on ML methods. The goal is to build and analyze models for predicting future pollution based on historical data and standard weather forecasts for *h-*days horizons. Moreover, for each subsequent day of a given horizon, the values of the pollutants and meteorological time series are replaced with the actually measured ones. Also, the weather forecasts are updated for the entire next horizon. Another main objective is to statistically investigate and compare the predictive abilities of two powerful ensemble tree machine learning methods—RF and Arcing (Arc-x4). The proposed real-type prediction approach can be classified as a generalization of the multi-input multi-output (MIMO) strategy, extending it in several aspects. This includes a new formula for calculating the final forecasts by averaging the forecasted values from the current single MIMO model and actual previous single models, the use of independent external forecasts for the predictors and lagged variables, and the implementation of five different statistical measures for evaluating and comparing the obtained results and the accuracy of models.

The main advantage of the developed strategy is the averaging of already obtained forecasts, which, to some extent, models directly existing relationships between the members of the forecasted time series. This refers to lagged variable-type relationships and internal dependencies that characterize each real-world time series. Another advantage is the minimal computation costs after the ML models are built. A drawback of the proposed approach is the possible accumulation of errors when summing the predictions of single MIMO models from the current and previous horizons. Besides, in our case, the great randomness of the predictors compensates for such types of errors, which are intended for real-world settings and do not significantly affect the good final results. Also, the bias is stable. Another difference with the standard MIMO strategy is that it uses all historical time series data, not just some fixed sliding data window.

The rest of the paper is organized as follows: Section 2 briefly introduces the concepts and reviews the literature on multi-step ahead forecasting strategies. Section 3 describes the framework of the proposed multi-step ahead forecasting approach, the model assumptions, and brief information on the methods and statistical measures used. The next section presents the study area, experiment data, the results of the application of the approach for three real-world air pollutants, data preprocessing, construction, investigation, and comparison of models. The last section discusses the study's main findings and draws conclusions.

This research is a part of the cloud Internet of Things (IoT) platform EMULSION [36].

#### **2. Concepts of Multi-Step Ahead Strategies and Literature Review**

The purpose of multi-step-ahead prediction is to forecast *h* values {*y*ˆ*N*+1, ..., *y*ˆ*N*+*h*}, where *h* is a forecast horizon (*h* > 1), based on known historical data {*y*1, *y*2, ..., *yN*} of the target time series *Y*. According to [37], three types of multi-step ahead strategies can be classified: multi-stage or recursive prediction, direct or independent value prediction, and parameter prediction. However, in more recent publications, this classification has been updated to five types [38–40]. All these strategies use a fixed number of historical data *D*, where *D* is called the embedding size. These strategies are Recursive, Direct, direct Recursive (DirRec), multi-input multi-output (MIMO), and direct MIMO (DIRMO) [38–40]. Some generalizations of these strategies are also discussed in [38], including lazy earning and some averaging algorithms for models built with these five strategies for the same horizon. The latter can be considered stacking models.

• Recursive Strategy

One of the most common approaches is recursive prediction (Rec). In the Rec strategy, the constructed time series model is applied *h* times sequentially as a one-step-ahead forecast procedure. Initially, the time series *Y* data used are {*yN*+1−*D*, ..., *yN*} to predict *y*ˆ*N*<sup>+</sup>1. To predict the next value *y*ˆ*N*+2, data {*yN*+2−*D*, ..., *yN*, *y*ˆ*N*+1} are used, etc. It is known that this strategy can produce accumulated errors and is therefore appropriate for relatively short forecasts (3–7 steps ahead). This is because the bias and variance from previous time steps are propagated into future prediction, as established in [37] for ARIMAtype models. However, the recursive strategy has been successfully applied to real-world time series with different ML algorithms (see [38]).

• Direct Strategy

In the direct prediction strategy, a separate model is built for each subsequent prediction *y*ˆ*N*+*i*, *i* = 1, 2, ... , *h* using the identical observations {*yN*+1−*D*, ..., *yN*}. So the number of models equals the number of prediction steps on the horizon. The Rec and Dir strategies are applied and compared for MLR, RNN, and hybrid HMM/MLR models in [37] for many different datasets. The authors concluded that the most accurate results were obtained using the direct prediction strategy.

• DirRec Strategy

DirRec is a combination of the Dir and Rec approaches. A separate model based on {*yN*+1−*D*, ..., *yN*, *y*ˆ*N*+1, ... , *yN*+*i*−1} data is generated to predict each new *y*ˆ*N*+*i*, *i* = 1, 2, ... , *h* value from horizon *h*. Note that the size *D* is different from other strategies.

• MIMO Strategy

The multi-input multi-output (MIMO) strategy involves building a single model with data {*yN*+1−*D*, ..., *yN*} to predict {*y*ˆ*N*+1, ..., *y*ˆ*N*+*h*} at a time. Thus, the forecasts are obtained with only one step for the entire horizon.

• DIRMO Strategy

Direct-MIMO (DIRMO) is a combination of the Dir and MIMO approaches. For this purpose, the horizon *h* is decomposed into several parts (blocks), and a MIMO strategy is applied to each block. The same {*yN*+1−*D*, ..., *yN*} data is used.

The five strategies described above have diverse applications with many ML methods. Various US economic time series were modeled in [41] using the Rec and Dir strategies. A hybrid system to generate multi-step deterministic and probabilistic forecasting is proposed

in [42]. A complex of five different ML algorithms is utilized: wavelet packet decomposition (WPD), gradient-boosted regression tree (GBRT), linear programming boosting (LPBoost), MLP, and the Dirichlet process mixture model. The models were used to predict PM2.5 concentrations from 1 to *h* interval data. The Dir strategy used for 1-, 2-, and 3-steps ahead is applied based on historical type test values. Similar results were obtained in [30], where, in addition, the 1 to *h* interval results were aggregated to a lower resolution of 1 day, which naturally improved the predictive ability of the models. In one study, the Rec versus Dir prediction strategy with RF models is compared for a period of 1 to 6 hours ahead in the case of wind speed [43]. The Dir approach is employed in [44] for predicting Spanish electricity consumption data for 10 years measured with a 10-minute frequency. The forecasts have been obtained using decision trees, GBRT, and RF algorithms with subsequent stacking.

The five commonly used methods described above, along with ARIMA and MLP for preliminary forecasting of the independent time series, have been applied and compared for daily PM2.5 forecasts for the next 10 days [39]. A recent study [40] developed a complex ensemble multi-step ahead forecasting system based on the same five methods. Least Square Support Vector Regression (LSSVR) and Long Short-Term Memory neural network (LSTM) are employed as the prediction tools. These are combined separately and compared with the Ensemble Empirical Mode Decomposition (EEMD) technique, boosting and stacking to obtain forecasts from 1-day-ahead to 10-day-ahead. In [38,45], more results and a literature review on multi-step ahead forecasting are presented.

#### **3. Materials and Methods**

*3.1. Proposed Approach*

3.1.1. Single Models

The objective of time series analysis and forecasting is to identify dependencies in its values and build a model able to predict the next values. A time series is an ordered, finite sequence of time-dependent data of the type

$$Z = \{z\_1, z\_2, \dots, z\_{t\prime}, \dots, z\_n\}, \ z\_t \in \mathbb{R}. \tag{1}$$

where *t* is the temporal index and *n* is the number of observations. Usually, the data are equidistant, with a different resolution scale (high-level—hourly and daily, or lowlevel—monthly, annual, or other types). The time series can be univariate or multivariate when it depends on other series determined in the same time period. In general, many time series are characterized by a complex structure and contain trends, seasonality, jumps, outliers, and other nonlinearities that complicate the task of building an adequate model.

This paper uses the following time series representation of the dependent variable to be predicted:

$$Y = \{y\_1, y\_2, \dots, y\_{t'}, \dots, y\_{N\_0}, y\_{N\_0+1}, \dots, y\_{N\_0+s}\}, \ s = 1, 2, \dots \tag{2}$$

where *yt* ∈ R, *N*<sup>0</sup> is the number of observations at some starting moment, and *s* stands for period step ahead in a multi-step procedure, which values are updated with the measured values at each increase of *t* = *s* by 1. That is, a successive updating horizon is applied. In a real situation, for forecasting with a regression model with a horizon *h*, future values of *r* independent variables **X***<sup>t</sup>* = (*X*1,*t*, *X*2,*t*, ..., *Xr*,*t*) should be available. To reflect this, in the multi-step modeling, we assume that each of these is given as a dynamically changing time series: 

$$\begin{array}{lcl}X\_j^{(s)} = \left\{ \mathbf{x}\_{\text{j},1}, \mathbf{x}\_{\text{j},2}, \dots, \mathbf{x}\_{\text{j},N}, \widetilde{\mathbf{x}}\_{\text{j},N+1}, \widetilde{\mathbf{x}}\_{\text{j},N+2}, \dots, \widetilde{\mathbf{x}}\_{\text{j},N+h} \right\}\_{\text{\textquotedblleft}}\\N = N\_0 + s - 1, \quad j = 1, 2, \dots, r; \quad s = 1, 2, \dots, S\end{array} \tag{3}$$

where *N* is a calibration data end of known target values, *xj*,*<sup>k</sup>* are measured values, and *<sup>x</sup>*!*j*,*<sup>k</sup>* are unmeasured forecasted future values, which are updated with the new measured values at each increase of *s* by 1.

We will consider the simultaneous prediction of *h* future values of *Y* at prediction step *s* by assuming the following general type of dependence:

$$\begin{aligned} \mathbf{Y}\_{t}^{(s)} &= (Y\_{\mathcal{N}+1}, Y\_{\mathcal{N}+2}, \dots, Y\_{\mathcal{N}+h}) = \\ \mathbf{F}\_{t} &(Y\_{t-p'}, Y\_{t+1-p'}, \dots, Y\_{t-1}, Y\_{t}; \mathbf{X}\_{t-q'}, \mathbf{X}\_{t+1-q'}, \dots, \mathbf{X}\_{t-1}, \mathbf{X}\_{t}; \tilde{\mathbf{X}}\_{\mathcal{N}+1}, \dots, \tilde{\mathbf{X}}\_{\mathcal{N}+h}) + \varepsilon\_{t} \\ t &= 1, 2, \dots, N+h \end{aligned} \tag{4}$$

where *Ft* is a non-linear real-valued function dependent on the values of the dependent variable *Y* in some previous moments *t* − *p* , ... , *t* − 1, *t*, **X***<sup>t</sup>* = (*X*1,*t*, *X*2,*t*, .., *Xr*,*t*) are the predictors in the previous and/or current time *t* − *q* , ... , *t*, the terms **X**!*N*<sup>+</sup>1, ... ,**X**!*N*+*<sup>h</sup>* denote the *<sup>h</sup>* forecasted ahead predictor values, and *<sup>ε</sup><sup>t</sup>* ∈ *<sup>N</sup>*(0, *<sup>σ</sup>*2) is supposed to be a white noise process. The forecasted values are denoted by

$$\hat{\mathbf{Y}}\_{t}^{(s)} = \begin{pmatrix} \hat{\mathbf{Y}}\_{N+1} \ \hat{\mathbf{Y}}\_{N+2} \dots \ \ddots \ \vdots \ \hat{\mathbf{Y}}\_{N+h} \end{pmatrix} = \mathbf{f}\_{t} \tag{5}$$

To determine them for every step *s* = 1, 2, ... a single predictive model *G*ˆ(*s*) *<sup>t</sup>* of type (5) with forecasts is first built:

$$\left(\hat{\mathfrak{g}}\_{N+1\prime}^{(s)}, \hat{\mathfrak{g}}\_{N+2\prime}^{(s)}, \dots, \hat{\mathfrak{g}}\_{N+h}^{(s)}\right), \ N = N\_0 + s - 1\tag{6}$$

Our study will consider that these single models are constructed with the same method as the successive rolling procedure. However, they could be generated using different methods and algorithms since they are independent.

#### 3.1.2. Averaging Models

To extend the multi-step ahead forecasting strategies known in the literature, we propose the following approach. We will define the sought predictions (5) for each horizon step *s* by averaging the already calculated and actual predictions of the single models *g*ˆ (*i*) *t* up to step *i* by the expressions

⎧ *g*ˆ (1) *<sup>t</sup>*+1, *g*ˆ (1) *<sup>t</sup>*+2, *g*ˆ (1) *t*+*h s* = 1 1 2 (*i*) 2 (*i*) 2 (*i*) (2) 

%(*s*) = ⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨ 2 ∑ *i*=1 *g*ˆ *<sup>t</sup>*+2, <sup>1</sup> 2 ∑ *i*=1 *g*ˆ *<sup>t</sup>*+3, ..., <sup>1</sup> 2 ∑ *i*=1 *g*ˆ *<sup>t</sup>*+*h*, *g*ˆ *t*+*h*+1 *s* = 2 ... ... *s* (*i*) *s* (*i*) *s* (*i*) *s* (*i*) (*s*) (7)

$$
\begin{pmatrix}
\hat{\mathbf{y}}\_{t+h},\hat{\mathbf{y}}\_{t+2},\dots,\hat{\mathbf{y}}\_{t+h}\big)^{(t)} = \begin{pmatrix}
\left(\frac{1}{t}\sum\_{i=1}^{s}\hat{\mathbf{g}}\_{t+s^{i}}^{(l)},\frac{1}{t}\sum\_{i=1}^{s}\hat{\mathbf{g}}\_{t+s+1}^{(l)},\dots,\frac{1}{s-1}\sum\_{i=2}^{s}\hat{\mathbf{g}}\_{t+h+2-s^{i}}^{(l)},\frac{1}{s-2}\sum\_{i=3}^{s}\hat{\mathbf{g}}\_{t+h+3-s^{i}}^{(l)},\dots,\hat{\mathbf{g}}\_{t+s+h-1}^{(s)}\right) & \mathbf{s} < h & (\mathbf{T}\_{h}) \\
\cdots & \mathbf{0} & \mathbf{0} & \mathbf{0} & \mathbf{0} \\
\left(\frac{1}{t}\sum\_{i=s-h+1}^{s}\hat{\mathbf{g}}\_{t+s^{i}}^{(l)},\frac{1}{h-1}\sum\_{i=s-h+2}^{s}\hat{\mathbf{g}}\_{t+s+1}^{(l)},\dots,\frac{1}{2}\sum\_{i=s-1}^{s}\hat{\mathbf{g}}\_{t+s+h-2}^{(l)},\hat{\mathbf{g}}\_{t+s+h-1}^{(h)}\right) & \mathbf{s} \ge h
\end{pmatrix}
$$

where *t* = *N*

For example, Table 1 shows the sequential symbolic forecasts for the horizon *h* = 5. Every single model with starting day *s*, according to (6), is a vector of dimension *h* in column *s*. To calculate the averaging model according to (7), we use the currently available forecasts from the current and previous single models for each day, starting from *t* + *s*. For instance, at *s* = 1, the prediction (7) is equal to the first single model (the vector in the first column (*s* = 1) of Table 1). For the next day, *s* = 2, we have the first two single models in the first two columns from *t* + 2, so we can average over these predictions in Table 1 to predict the horizon from *t* + 2 to *t* + 6, etc. After *s* = *h*, we will have the complete vectors of predictions according to the last formula from (7). For example, in Table 1, for the case *s* = 5, the regions covering the terms that are averaged by (7) to obtain the predictions in a new h-dimensional vector from *t* + 5 to *t* + 9 are marked with dashed lines.


**Table 1.** Forecasts for the case *h* = 5 .

3.1.3. Framework of the Proposed Strategy

Our proposed strategy involves two main stages:

Stage 1: Generating Initial Models

This is a procedure for building, calibrating, and selecting an initial model based on historical data of type (1) using a dataset of size *N* = *N*0. For this purpose, we use known measured values for the dependent Y and independent variables X. The first part of the data for *t* = 1, 2, ... , *N*<sup>0</sup> − *v* will serve to train and validate the models, and the last *v* values are for independent "out of sample" testing. This study uses *v* = 31 or the test sample data for one month. The main result of this stage is the determination of optimal hyperparameters after evaluation of the data from testing and error correction using ARIMA. In this case, the models will become hybrids. A general scheme of the generation of the initial models in stage 1 is shown in Figure 1.

**Figure 1.** Flowchart of the algorithm of stage 1: Generating initial hybrid models.

When this approach is applied over a long period of time, stage 1 may be periodically initialized to update the hyperparameters.

Stage 2: Multi-Step Ahead Forecasting

This is the core of the proposed approach, which includes the following:


The single independent models are built using the hyperparameters of the ML initial models, obtained in stage 1. The predictors are the measured data for air pollutants and independent time series, their values at previous moments (lagged variables), and forecasted (unmeasured) data for independent variables. For each given time period of horizons *s* = 1, 2, ... , *S*, a separate single model is built and evaluated that predicts *h* values as described in (4), (6). This is followed by a statistical evaluation of the models and residual diagnostics, including possible error correction with an appropriate ARIMA model.

The corresponding averaging models (5) are obtained by using the known forecasts of single models (6) in (7) up to a given horizon *s* (see also Table 1). The details of stage 2 are shown in Figure 2.

**Figure 2.** Flowchart of the algorithm of stage 2: Multi-step ahead forecasting.

#### *3.2. Model Assumptions*

Each model is built on clearly defined assumptions that determine its limitations for practical application. It complies with


In our implementation, only a limited number of factors affecting air pollutants are used, limited to those measured by state-certified automatic measuring stations in the Republic of Bulgaria, synchronized with European criteria [46]. In this study, time series of three pollutants and eight meteorological variables were used. In order to account for the influence of the remaining unmeasured factors, lagged variables containing deterministic and stochastic information on unmeasured factors were used. Predicted and unmeasured weather forecasts in the predictor variables are recorded by us day by day for the selected time period. However, such types of forecasts can be freely retrieved from multiple sources on the Internet for any major population location over 3-, 5-, and 10-day weather forecast time intervals.

As can be seen from (4), the approach enables the use of arbitrary predictors and is not limited to meteorological ones as in this study.

#### *3.3. Methods*

We will use two ensemble tree methods: RF and Arcing (variant arc-x4) (ARC). ARIMA will also be applied for residual correction to improve model accuracy and adequacy. Ensemble methods are presented and discussed, for example, in [47,48].

• Ensemble Model

An ensemble model is called the linear combination:

$$\overline{f}(\mathbf{x}) = \sum\_{j=1}^{M} w\_j f\_j(\mathbf{x}). \tag{8}$$

with weights *wj* satisfying the conditions

$$\sum\_{j=1}^{M} w\_j = 1, \quad 0 < w\_j \le 1, \ j = 1, 2, \dots, M. \tag{9}$$

where *fj*(*x*), *j* = 1, 2, ... , *M* are singular models created with the same algorithm for different perturbed samples *x*. In this paper, we will consider methods for which:

$$w\_{\circ} = \frac{1}{M}.\tag{10}$$

In the case of regression, the final ensemble model is the arithmetic mean of the predictions of its constituent component models.

• Random Forest

The RF algorithm was developed by Leo Breiman in his well-known paper from 2001 [49], after combining his bagging idea with the random subspace method created by Tin Kam Ho in 1995 [50]. It can be briefly characterized as a bagged tree classifier using a majority vote. RF is a high-performance ensemble method with tens or hundreds of unpruned decision trees. Generally, RF applies to regression and classification for cross-sectional and time-series datasets with any type of variable. The same procedure is applied for the construction and training of each individual model (tree) *fj*(*x*) from the ensemble (8). Given an initial sample of size *n*, the RF algorithm selects a random sub-sample, called out-of-bag (OOB), comprising about one-third of all data to test the model. From the remaining up to *n* instances onward, perturbed (randomized) samples are formed by subtraction with replacement using bagging [49]. A component tree is built using a recursive binary procedure with the formed sample. The resulting trees *fj*(*x*) are different and independent of each other. An important aspect of the RF algorithm is the random selection of a subset of all available predictors, called *mtry* (typically *mtry* = 3 to 5), when dividing the cases at each current node of the tree. The final RF model of *mtree* = *M* is found by (8), (10). RF's ability to calculate variable importance for each component tree and the composed ensemble model is useful for regression practice. It should be noted that RF is not particularly sensitive to the phenomenon of multicollinearity and is applicable even with highly correlated variables [51].

The main control hyperparameters set before the start of the RF algorithm are: *mtree*—the number of trees in an ensemble; *nodesize*—the size of the smallest allowable parent node; and *mtry*—the number of predictors randomly selected for splitting at each node. The last of these hyperparameters does not significantly affect the model's results.

• Arcing

In this paper, we will apply the Arcing method (adaptively resample and combine), also known in the literature as Arc-x4. It was proposed and studied by Leo Breiman [52]. The algorithm is classified in the group of boosting methods, but it is relatively rarely used, and its predictive properties have not been sufficiently studied. This applies in particular to its ability to forecast time series. By the way, in [53], the authors show empirically that Arc-x4 outperforms all other algorithms from the boosting class for classification applied to real binary databases.

The algorithm induces an ensemble of sequentially dependent classifiers (models) *C*1, *C*2, ... , *Ck*, ... , *CT* for a number of trials *T*. At the *k*-th step, the classifier *Ck* is training on the current resampled set *Tk*, and runs the original training set *T* down *Ck* by updating the probabilities *P*(*k*+1) for the next classifier *Ck*<sup>+</sup><sup>1</sup> by the expression

$$P^{(k+1)}(i) = \frac{1 + m(i)^4}{\sum \left(1 + m(i)^4\right)}.\tag{11}$$

where *m*(*i*) is the total number of misclassifications of case *i* by the previous classifiers *C*1, *C*2, ... , *Ck*. Unlike AdaBoost [54], classification is performed with unweighted voting, and in the case of regression, the prediction is averaged with equal weights according to (10). It is established that Arc-x4 reduced both the bias and variance of unstable models [52,53,55].

### • Autoregressive Moving Average with Transfer Functions

The autoregressive moving average (ARIMA) is a linear type method, also known as the Box-Jenkins methodology [56], widely used for time series analysis and forecasting in statistics and econometrics. The main requirements for its application are the normality of data and stationarity, i.e., a constant mean and variance of the involved time series. However, in large sample sizes (e.g., where the number of observations per variable is greater than 10), violations of the normality assumption often do not noticeably impact the results [57]. In the more general case, the time series (1) may not be stationary and show a deterministic trend of some order (linear, quadratic, etc., up to some order *d*). Let us denote the back-shift operator *BZt* = *Zt*−<sup>1</sup> with (1 − *B*)*Zt*. The transition to a stationary time series could be performed with a preliminary calculation of *d* finite differences of the series with an operator of this type (1 − *B*) *d* . The one-dimensional (univariate) time series ARIMA (*p*, *d*, and *q*) model has the following form:

$$\left(1 - \sum\_{j=1}^{p} \phi\_{\hat{\jmath}} B^{\hat{\jmath}}\right) (1 - B)^{d} Z\_{t} = \left(1 - \sum\_{j=1}^{q} \theta\_{\hat{\jmath}} B^{\hat{\jmath}}\right) a\_{t} + c\_{\prime} \tag{12}$$

where *p*, *d*, and *q* are model parameters, with constant non-negative integers for each *t*. Here, *p* is the number of autoregressive (AR) terms, *d* is the order of differencing, *q* is the number of moving average (MA) terms *at*, and *c* is an additive constant [56]. In Equation (12), *φ*1, *φ*2, ... , *φ<sup>p</sup>* are estimates of the autoregressive part (AR) and *θ*1, *θ*2, ... , *θ<sup>q</sup>* are estimates of the moving average (MA).

When predictor time series (called transfer functions (TF)) are also used in the modeling, the method is called ARIMA/TF. Predictors are set with parameters (*p*, *d*, and *q*) of the same type. Model (12) takes the form:

$$
\Delta^d Z\_t = \frac{MA}{AR} a\_t + \sum\_{i=1}^k \left( \frac{Num\_i}{Den\_i} \Delta\_i^{d\_i} B^{b\_i} X\_{it} \right) + \mu. \tag{13}
$$

where *Xi*, *<sup>i</sup>* = 1, 2, ... , *<sup>k</sup>* are the predictor time series, <sup>Δ</sup>*<sup>d</sup>* = (<sup>1</sup> − *<sup>B</sup>*) *d* , <sup>Δ</sup>*di* = (<sup>1</sup> − *<sup>B</sup>*) *di* , *Bbi* is a delay term of positive integer order *bi*, *MA*, *AR*, *Numi*, *Deni* are difference polynomials, dependent on the predictor's parameters, and *μ* is a constant [56].

• Hybrid method

Let us denote the generated and validated RF and ARC models at each prediction step *s* by *Bs* and their residuals by *res\_Bs* with values

$$res\\_B s\_{\underline{t}} = \mathbf{Y}\_{\underline{t}} - B s\_{\underline{t}}.\tag{14}$$

If the corresponding ARIMA/TF model of *res\_Bs* is *Ar\_res,* built with the actual observations to forecast the entire considered period, then the hybrid model *hBs* and its residuals are calculated by

$$h\text{Bs} = \text{Bs} + \text{Ar\\_res\\_res\\_h} = \text{Y} - h\text{\\_}\text{Bs}.\tag{15}$$

#### *3.4. Evaluation Measures*

Let *Y* be the observed true time series (target) and *P* be the model prediction, *Yi* and *Pi* (*i* = 1, 2, ... , *n*) are their values, respectively, *P*, *Y* are mean values, and *n* is the sample size. The following well-known statistical measures of accuracy are considered to evaluate the prediction performance of the constructed ML models: Root mean squared error (RMSE), normalized relative mean squared error (NMSE), fractional bias (FB), Theil's forecast accuracy coefficient *UI I* [58], coefficient of determination (*R*2), and index of agreement (IA) [59], given by the expressions:

$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} (Y\_i - P\_i)^2}, \quad NMSE = \frac{\sum\_{i=1}^{n} \left(Y\_i - P\_i\right)^2}{\sum\_{i=1}^{n} \left(Y\_i - \overline{Y}\right)^2} \tag{16}$$

$$FB = 2\frac{\overline{Y} - \overline{P}}{\overline{Y} + \overline{P}}, \quad \mathcal{U}\_{II} = \frac{\sqrt{\sum\_{i=1}^{n} (Y\_i - P\_i)^2}}{\sqrt{\sum\_{i=1}^{n} Y\_i^2}}\tag{17}$$

$$R^2 = \frac{\left\{\sum\_{i=1}^n \left(P\_i - \overline{P}\right) \left(Y\_i - \overline{Y}\right)\right\}^2}{\sum\_{i=1}^n \left(P\_i - \overline{P}\right)^2 \cdot \sum\_{i=1}^n \left(Y\_i - \overline{Y}\right)^2}, \quad IA = 1 - \frac{\sum\_{i=1}^n \left(P\_i - Y\_i\right)^2}{\sum\_{i=1}^n \left(\left|P\_i - \overline{Y}\right| + \left|Y\_i - \overline{Y}\right|\right)^2} \tag{18}$$

RMSE and NMSE are used to assess the model's accuracy. The FB index measures the tendency of a model to over-predict with values close to 2 and under-predict with values close to −2. IA is a dimensionless and bounded measure in [0, 1] with values closer to 1, indicating better agreement between the model and the target variable. The coefficient *UI I* is dimensionless and is used to compare models obtained by different methods and to identify large values. The model is considered to be of good quality when *UI I* is less than 1. A good predictive model should have a value close to 0 for RMSE, NMSE, and FB and a value close to 1 for *R*<sup>2</sup> and IA. It should be noted that using the RMSE and coefficient of determination *R*<sup>2</sup> to compare models and forecasts should be interpreted with care, as this may result in misleading conclusions ([60], Ch. 14). Also, statistical significance may be useful for small validation samples to judge whether accuracy differs among reasonable forecasting methods. For construct validity, the accuracy measures should agree ([60], Ch. 14).

Models and statistical analyses were performed using Salford Predictive Modeler 8.2 (SPM) [61] and IBM SPSS statistics software, version 28.0 [62,63] on a laptop (Acer, Intel Core i7, CPU 1.8 GHz).

#### *3.5. Study Area and Data*

The proposed approach from Section 3 was applied to predict the air pollutants in Pernik, a typical medium-sized city in Bulgaria. Pernik is located in western Bulgaria, about 20 km (12 miles) southwest of the capital Sofia, with a population of 70,000 as of 2021. The city is at an altitude between 700 and 850 m (2297 and 2789 feet), has a length of 22 km (14 miles), and is surrounded by three low mountains. Through the city flows the river Struma. The city's territory is crossed by major roads, including Pan-European Corridors VIII and IV, which connect Central Europe and Greece. The climate of Pernik is moderately continental. Economically, the city is an industrial zone with steel production, heavy machinery (mining and industrial equipment), brown coal, building materials, and textiles. The location of Pernik is 42◦36 N 23◦02 E.

A dataset was collected for the concentration of three air pollutants (PM10, SO2, and NO2) in the city of Pernik. Figure 3 shows the sequence plots of the pollutants. Daily data are modeled from 1 January 2015 to 9 February 2019. In the first stage, the training set is taken from 1 January 2015 to 21 December 2018 (1450 days), and the test period covers the next 31 days until 21 January 2019. The independent meteorological variables are eight: maximum air temperature (*maxT*, ◦C), minimum air temperature (*minT*, ◦C), wind speed (*speed*, m/s), wind direction (*direction*\$), atmospheric pressure (*press*, mbar), cloud cover (*cloud*, %), relative humidity (*humidity*, %), and precipitation (*precipi*, mm). All measured data have been gathered from the official site of the automatic measuring station in Pernik [64,65] and the forecast weather data from the official site Sinoptik.bg.

**Figure 3.** Sequence plots of the examined pollutant data: (**a**) PM10, (**b**) SO2, (**c**) NO2. The horizontal red line in (**a**) indicates the European and national standard for the upper daily PM10 limit of 50 μg/m3. The blue vertical lines separate the training and test samples.

#### **4. Results**

#### *4.1. Preliminary Statistical Processing*

The preliminary statistical processing of the data includes descriptive statistics, treatment of outliers and missing data, research on the multicollinearity of variables, and examination for sequence autocorrelation.

Descriptive statistics for the initial sample of *n* = 1481 cases are given in Table 2. Of these, pollutant data for the last 31 days was used as an independent test sample in building the initial models. This part of the data can be seen in Figure 3 on the right side of the vertical blue lines on 22 December 2018. Missing data are below 5% for all samples. In the analyses, they are replaced by the method of linear interpolation. Also, Table 2 shows large values of the skewness and kurtosis for the three pollutants, speed and precipi. This is a sign that the distribution of these variables is not normal. This could affect the direct application of classical regression methods but not the ML techniques. In Table 2, large values are observed, particularly for PM10 and SO*2*. To reduce the influence of single spikes, available outliers of less than six cases are replaced by the values of their next largest value. We denote the obtained working variables for the pollutants as *YPM10* (PM10), *YSO2* (SO2), and *YNO2* (NO2). Their statistics are presented in the first three columns of Table 2. Figure 4 shows the boxplots of the distributions of these variables, used hereafter as dependents.


**Table 2.** Summary statistics of the initial data for pollutants and meteorological variables.

**Figure 4.** Box-plots of the used variables *YPM*10, *YSO*2, and *YNO*2 for PM10, SO2, and NO2, respectively.

It is known that the accuracy of regression models could be affected by the presence of multicollinearity between variables. Statistical analysis was performed to check for bicorrelation in the data using the non-parametric Spearman's Rho test. The Rho coefficients were the largest in absolute value only for (*minT*, *maxT*), equal to 0.969. All other Rho coefficients are less than 0.7. Since only 3 to 4 randomly selected predictors are used in the RF and ARC algorithms for each tree node splitting, we can assume that our data have no problematic multicollinearity.

Moreover, the autocorrelation functions (ACF) and partial ACF (PACF) plots of all considered time series showed that they do not exhibit trends. The ACF and PACF of *YPM10* indicated large PACF coefficients for the first 2 to 3 lags, for *YSO2* and *YNO2*—to the second lag, and for meteorological variables, an influence was found only for lag 1. Thus, in the general model (4), in our case, it is obtained *p* ≤ 2, *q* ≤ 1. That is, we will use lagged variables of the dependent variables up to the second order and for all predictors up to the first order.

In addition, in Figure 5 and Table 3, we give an example of a comparison of the measured values and forecasted weather conditions for one horizon of *h* = 10 days used in this study. There are some pretty big inaccuracies in these weather forecasts, except for those about relative humidity.

K

**Figure 5.** Measured meteorological values and their ex-ante weather forecasts (\_f) for a 10-day horizon used in the multi-step procedure (example dataset): (**a**) MaxT, (**b**) MinT, (**c**) speed, (**d**) cloud, (**e**) precipi, (**f**) pressure, and (**g**) humidity.


**Table 3.** Example data for measured values of wind direction and corresponding weather forecasts.

#### *4.2. Construction and Evaluation of the Initial Hybrid Models*

A basic principle of forecasting is the construction of a model that well explains large historical variations in the dataset [60]. This is our first task. At this stage, the dependent variables *YPM10*, *YSO2*, and *YNO2* of air pollutants PM10, SO2, and NO2, respectively, and the eight meteorological variables are used. They cover a period of *n* = 1481 days from 1 January 2015 to 21 January 2019. Of these, the data for the first *N*1 = *N-v* = 1450 were used for training and validation, and the last v = 31 days were used as a hold-out (out-of-sample) test sample. Two lagged variables each were used for *YPM10*, *YSO2*, and *YNO2*, and one lagged variable each for all predictors was used for all initial ML models.

Multiple RF models were built and tuned varying for different selections of hyperparameters: number of trees in the model (*mtree*) from 100, 200, and 300; minimum number of cases for *nodesize* (5 and 10); and *mtry* = 3 and 4 for the random selection of predictors for splitting from a pool of 19 predictors. The models are trained with OOB procedures. Arcing models (denoted AR or ARC) were selected among models with 20, 30, 40, and 50 trees; a minimum number of cases in parents to child nodes was *m*1:*m*2 = 10:5, *mtry* = 3. ARC models were cross-validated (CV) with standard 5-fold and 10-fold CV. Here we follow the recommendation of [66] to use k-fold cross-validation over hold-out validation. Along with this, for greater precision, the initial models were also tested with a separate hold-out test sample of v = 31 days.

The hyperparameters for the RF models for the three pollutants showed close values: *mtree* = 300, *nodesize* = 5, and *mtry* = 3, OOB validation. The ARC models with the best statistics are ensembles with 50 trees, *m*1:*m*2 = 10:5, *mtry* = 3, 10-fold CV scheme.

From the built RF models, three models were selected, labeled *TRF\_P, TRF\_S*, and *TRF\_N*, for PM10, SO2, and NO2, respectively. Similarly, three ARC models were selected: *TAR\_P*, *TAR\_S*, and *TAR\_N*. After a detailed examination of their residuals, it was found that there were weak autocorrelations. To ensure a lack of fit, ARIMA/TF models of the residuals were built for correction. All predictors were used as transfer functions. The corrections are added to the initial models to construct the hybrid test models using (14)–(15). They are denoted by *hTRF\_P*, *hTRF\_S*, etc. The basic descriptive statistics of the dependent variables *YPM10*, *YSO2*, and *YNO2* were compared with these hybrid models in Table 4. Reasonably good agreement of the relevant descriptive statistics is observed for the RF and ARC models with *YPM10*, *YSO2*, and *YNO2*, respectively.

**Table 4.** Descriptive statistics of the initial hybrid models for the test sample vs. measured values a.


a. the standard error of skewness for all variables is 0.064; the standard error of kurtosis for all variables is 0.127.

The following Table 5 presents the main performance results of the initial hybrid models. In row 4 the parameters of the ARIMA/TF models of the residuals are given. In their estimation, insignificant variables and lags were removed at the significance level *α* = 0.05 In row 5 are the estimated Ljung-Box test statistics for lack of fit applied to the residuals of the ARIMA/TF models [67]. All Ljung-Box statistics are insignificant at level *α* = 0.05, which allows to reject the null hypothesis, indicating that the models exhibit significant autocorrelations. For the six hybrid models in Table 5, the Ljung-Box test was applied to the 24 lags [68]. The last six rows of Table 5 present the statistics from (16)–(18). These show that all hybrid test models perform very well, with the ARC models outperforming the RF models with the exception of fractional bias.


**Table 5.** Performance statistics of the hybrid RF-ARIMA/TF and ARC\_ARIMA/TF initial models.

Figure 6 illustrates the behavior of Ljung-Box coefficient significance values where the underlying process assumed independence (white noise).

**Figure 6.** Significance values of Ljung-Box coefficients for residuals of the initial hybrid test models.

Based on the performed diagnostics, we can conclude that the initial hybrid models are adequate and can be used to predict future pollutant values [33,68].

In particular, separately for all three dependent variables (*YPM*10, *YSO*2, and *YNO*2), the corresponding variable importance was established. In all three cases, the results indicated that the lagged variables of the targets, minimal air temperature, and wind speed are among the most important predictor variables for the training process.

#### *4.3. Results from Stage 2—Multi-Step Forecasting*

Following the algorithm in Figure 2, the built and calibrated initial models are extended step by step to calculate the *h*-day forecasts of the unknown concentrations of the three pollutants. All models use the already established hyperparameters from stage 1. A separate model is built according to (6) to predict each future horizon *h*.

For completeness, in Figure 7, we present the measured values of air pollutants for 17 days, which we further seek to predict. Some outliers are observed in the first seven days.

**Figure 7.** Measured values of the three air pollutants to be predicted.

4.3.1. Construction and Evaluation of the Single Models

For each of the three pollutants, two single hybrid models were built-with RF and ARC. The models are labeled *RF\_P* and *AR\_P* (for PM10), *RF\_S* and *AR\_S* (for SO2), and *RF\_N* and *AR\_N* (for NO2), respectively, at each horizon step *s*, *s* = 1, 2, ... , 17. The first prediction horizon (*s* = 1) starts from 15 January 2019 and uses data from 1 January 2015 to 14 January 2019 with known data plus 10 days ahead with forecasted meteorological data. This sets the value of the initial calibration data, where *N*<sup>0</sup> = 1474. The total number of single models needed to forecast 10 days ahead, performing *s* = 17 period steps, is 102.

From the obtained results, Figure 8 (in the left side Figure 8a,c,e,g,i) illustrate the evaluation statistics (16)–(18) of the horizon forecasts from all created single models. It is seen from Figure 8a that the RMSEs of model *RF\_P* are smaller than those of model *AR\_P*. The same ratios are observed for *RF\_N* and *AR\_N*. Similar are the results for *RF\_S* and *AR\_S*, at *s* > 6. The larger error values for *s* < 7 are probably due to the more difficult prediction of large outliers in the original data illustrated in Figure 7. In the case of NMSE in Figure 8c, we have similar results. Even here, the differences are larger in favor of RF models. In Figure 8f, the values of all FBs are in the interval (−0.65, 0.70) without large deviations and with a decreasing trend. The largest range is observed in FB for model *AR\_S*. Figure 8g shows for *Uii* the same ratios as for NMSE. All of Theil's *Uii* coefficients are less than 1. This indicates the models' very good predictive performance (see also Section 3.4). The last figure in Figure 8i shows the IAs of the forecasts that vary strongly in the interval (0.1, 0.8). Here, the IA values of the RF models outperform, although less so than the corresponding values of the AR models. We have an exception for *RF\_N* at *s* > 10. The overall conclusion is that, despite the better statistical performance of the initial AR models, the RF models do slightly better in predicting ex ante pollutant concentrations, and that is performed with highly changeable predictors.

**Figure 8.** *Cont*.

**Figure 8.** Comparison of the prediction accuracy statistics of all single models RF\_ and AR\_ (on the left) and the corresponding averaging models aRF\_ and aAR\_ (on the right): (**a**,**b**) RMSE; (**c**,**d**) NMSE; (**e**,**f**) FB; (**g**,**h**) Uii; (**i**,**j**) IA.

#### 4.3.2. Construction and Evaluation of the Averaging Models

After computing the 102 single models for each period step *s*, each with a horizon *h*, forecasts are obtained. They are averaged for each day by (7). The predictive averaging models are labeled *aRF\_P* and *aAR\_P* (for PM10), *aRF\_S* and *aAR\_S* (for SO2), and *aRF\_N* and *aAR\_N* (for NO2). In our case, for horizon *h* = 10, we have calculated the forecast values for *S* = 17 periods.

From the obtained results, Figure 8b,d,f,h,j (right column) illustrate the corresponding evaluation accuracy statistics (16)–(18) for all created averaging models. Figure 8b shows RMSE behavior almost identical to that of single models. Especially for the last 10 horizons, in most cases, the RMSEs of the averaging models are smaller than the corresponding RMSEs of the single models. In particular, for PM10 the *aRF*'s RMSE is equal to 13.1 μg/m<sup>3</sup> vs. 13.8 μg/m3. For the NO2 model, *aRF* shows 21.5 μg/m3 vs. 23.8 μg/m3. For SO2, *aAR* has RMSE = 17.3 vs. 17.4, and for NO2, the *aAR* model has RMSE = 22.7 vs. 27.5 for the single model, respectively. In Figure 8d, at *s* > 5 NMSEs, the averaging models appear more smoothed compared to the corresponding values of single models in Figure 8c. Here there

is an exception for *s* = 17 in model *AR\_P*. Fractional bias is within the same limits of [−0.65, 0.70] for all constructed models, as shown in Figure 8e,f. Also, all of Theil's *Uii* coefficients are less than 1, which indicates a very good predictive ability for averaging models.

Although to a lesser extent, the other comparisons lead to the same general conclusion as for single models: a slightly pronounced superiority of RF models over AR. In the following two subsections, we will conduct statistical tests to check if there are statistically significant differences.

#### *4.4. Comparison of the Accuracy Measures of the Forecasts*

In this section, we compare the estimates of the statistical indicators (16)–(18) between the obtained final forecasts of the three pollutant targets (*YPM*10, *YSO*2, and *YNO*2) for the two methods and for the two multi-step-ahead prediction approaches. This does not include *R*<sup>2</sup> as noted in Section 3.4. For this purpose, we use Kendall tau-b rank correlation for paired samples, a nonparametric measure of the association between two sets of rankings. This statistic is useful for comparing methods when the number of forecasts is small, the distribution of the errors is unknown, or outliers (extreme errors) exist [60]. A higher positive correlation indicates better agreement between methods or models. The statistical significance of the coefficients is at level 0.05.

#### 4.4.1. Comparison among the Two ML Methods

Table 6 presents Kendall's tau-b correlations of accuracy indicators for the pairs of models obtained with RF and Arcing methods. Good agreement among the methods for calculating accuracy measures (RMSE) with coefficients from 0.6 to 0.8 is observed. For NMSE, the correlations are high for PM10 predictions (0.838 and 0.926); for the rest of the pairs, they are lower—around 0.4. The highest correlations are for FB for all model pairs (from 0.85 to 1.000). The correlations for AI are also high within the interval 0.5–0.7, except for the NO2 models. For *Uii*, medium and high correlation values are obtained for the PM10 and NO2 models. The lower *Uii* correlations for some models of SO2 (with low and insignificant correlations) can be explained by the few large outliers (see Figure 7). In general, following [60], it can be concluded that the correlations agree well, so the two methods exhibit almost equal predictive quality.

**Table 6.** Kendall's correlations for comparison of the forecast accuracy, calculated using the two methods, RF and Arc-x4, for 17 period steps, each one for a prediction horizon of *h* = 10 steps ahead.


<sup>a</sup> Insignificant coefficients.

4.4.2. Comparison among the Two Multi-Step Ahead Strategies

In Table 7, agreement among each pair of single and averaging models for five different statistical measures (16)–(18) of the forecasts (without *R*2) is presented. The correlations for RMSE are between 0.65 and 0.8, except for the NO2 models. The NMSE coefficients are similar (from 0.6 to 0.83), and a lower coefficient is observed for the NO2 models (0.309). The FB correlations show high values (0.8–0.99) for all model pairs. The correlations for *Uii* are medium to high, in the range of 0.49–0.75. The correlations for IA are weak, with an insignificant coefficient, except for 0.544 for (*AR\_P*, *aAR\_P*) and 0.632 for (*RF\_N*, *aRF\_N*). The results show reasonably good agreement among the forecasts of single and averaging models.


**Table 7.** Kendall's correlations for comparison of the forecast accuracy of single and averaging models for 17 period steps, each one for a prediction horizon of *h* = 10 steps ahead.

a. Insignificant coefficients.

#### **5. Discussion with Conclusions**

In this study, we developed a multi-step ahead ex-ante forecasting strategy for time series with stochastic and high-frequency behavior. As shown in the preliminary study of the data (Table 2 and Figures 4 and 5), the examined time series of air pollutants do not exhibit a normal distribution. They are characterized by many outliers that cannot be ignored. For the prediction of this type of data, we selected the ML methods RF and Arc-x4. We have previously explored many other methods to implement the modeling, including CART, MARS, LASSO, CART ensembles and bagging, stochastic gradient boosting, and more. We chose RF and ARC-x4 not only for their high statistical performance but also for their ability to predict new data well. The goal was to determine which ML methods are most suitable for achieving good results in a real-world situation. For the same reason, restrictions and assumptions are imposed on the predictors described in Section 3.2. Here, however, we must pay attention to the fact that lagged dependent variables were used as predictors, which indirectly reflected in a stochastic manner many other measurable and non-measurable factors influencing the pollution level. We have determined the approximate number of lags according to the behavior of the ACF and PACF of the dependent and meteorological variables. On this basis, a general form of the dependence in (4) is proposed. In this paper, we have chosen a short time horizon *h* of 10 days and repeated the experiments for 17 consecutive horizons (periods). We have yet to specifically investigate the most appropriate horizon length for the proposed strategy. This question remains open.

The developed forecasting strategy consists of two stages. The first stage is very important, with the selection and detailed examination of the candidate predictive models. First, RF and ARC models were created and analyzed, showing very good predictive properties, as can be seen from Table 5. The basic requirements for building models without autocorrelating residuals were carefully checked by examining the relevant ACF and using different statistical tests, including the Ljung-Box portfolio test. Some residuals were found to have values outside the confidence intervals. For this reason, all models had to be hybridized with the correction of their errors. This was done using the ARIMA method. Each hybrid model was calculated as a sum of the ML model and the ARIMA model of its residuals. The residuals of the hybrid methods were re-examined to obtain statistically valid models, checked by tests. Overall, the results on the right side of Table 5 suggest the hybrid initial ARC models outperform the RF models for all three pollutants.

The implementation of the second stage of the multi-step prediction required building a large number of single models for each horizon and each pollutant. The implementation turned out to be laborious. Updating the database should also be taken into account. For the first *s* = 1, 2, ... , *h* periods, the forecasts are the average value of the available forecasts from the previous and current models. The final averaging models were found using (7). The application of the proposed approach was demonstrated for three different pollutants—PM10, SO2, and NO2. The resulting final predictions were evaluated using five accuracy measures. The comparison of errors is illustrated in Figure 8 for both predictions of the single and averaging models. It is seen that the RF models achieve slightly more accurate predictions of tested future data in all cases. In addition, Kendall's correlation was performed to compare the association between the accuracy of the two methods (RF

and ARC) and the two strategies (single MIMO and averaging). In general, all indicators agree. Therefore, we can conclude that construct validity was obtained [60] and that both multi-step ahead approaches are alternatives.

Many studies have compared the performance of multi-step ahead strategies. However, due to the variety of modeling methods, accuracy criteria, and data, none of the existing strategies is known to be the best in all contexts. Nevertheless, the proposed approach can be formally compared with other results. For example, while in [37], the independent value prediction (Dir) was studied, in the present work, time series with predictors were used, ARIMA error correction was used, the data sets were updated dynamically, and more new elements were involved. The authors of the recent study [39] have employed all five strategies from Section 2 to forecast PM2.5 for 10 days ahead. The best results among all constructed models were achieved using the recursive strategy with LASSO feature selection and forecasting the predictors in future time with the ARIMAX model. In one study, the direct strategy for hourly 1- to 24-step-ahead predictions of the air pollution index in Malaysia is preferred over other approaches [45]. A stacked ensemble with diverse ML modeling techniques using different strategies was adopted for PM2.5 prediction in [39]. Existing multi-step ahead forecasting approaches have been thoroughly reviewed and compared empirically using 111 experimental datasets [38]. The authors concluded that multi-output strategies based on MIMO and DIRMO significantly outperform the single-output methods Rec, Dir, and DirRec. Our findings are primarily consistent with the results in [38].

Compared to the five standard multi-step ahead strategies, the proposed approach can be classified as an extension of the MIMO strategy and called MIMO averaging. It can also be noted that a large number of diverse ML methods are used in the subject area. To the best of our knowledge, the ability of Arc-x4 is examined for the first time in our study for multi-step ahead prediction. Although the limitations and assumptions of the models are laid out in this paper, the proposed MIMO averaging strategy is general. It can be applied to many different predictors, including dummy, land-use, or other types of variables. Some of the following questions remain open for further study: choice of horizon length; optimization of the coefficients in front of individual single models in the summation formula (7); the possibility of stacking the forecasts of single models built by diverse ML algorithms, and more.

For our data and the chosen horizon, *h* = 10, the proposed strategy is seen as an alternative to other multi-step ahead prediction methods. It can be used in comparison with or in conjunction with other approaches. Finally, we can conclude that the proposed MIMO averaging ex-ante forecasting strategy has the potential for real-world application and solving tasks of public interest, such as informing the population about the levels of the main air pollutants in a given local area.

**Author Contributions:** Conceptualization, S.G.-I., A.I. and H.K.; Data curation, A.I. and M.S.-M.; Investigation, S.G.-I., A.I., H.K. and M.S.-M.; Methodology, S.G.-I., H.K. and M.S.-M.; Resources, A.I.; Software, S.G.-I., A.I., H.K. and M.S.-M.; Validation, S.G.-I., A.I., H.K. and M.S.-M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study has emanated from research conducted with the financial support of the Bulgarian National Science Fund (BNSF), Grant number KP-06-IP-CHINA/1 (KΠ-06-ИΠ-KИTAЙ/1). The authors also acknowledge the support of the Bulgarian National Science Fund, Grant KP-06-N52/9.

**Data Availability Statement:** The data used in this study are freely available on the official websites provided in references [64,65].

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
