*Article* **Analysis of Atmospheric Pollutant Data Using Self-Organizing Maps**

**Emanoel L. R. Costa 1,†, Taiane Braga 2,†, Leonardo A. Dias 3,† , Édler L. de Albuquerque 4,\* ,† and Marcelo A. C. Fernandes 1,5,\* ,†**


**Abstract:** Atmospheric pollution is a critical issue in our society due to the continuous development of countries. Therefore, studies concerning atmospheric pollutants using multivariate statistical methods are widely available in the literature. Furthermore, machine learning has proved a good alternative, providing techniques capable of dealing with problems of great complexity, such as pollution. Therefore, this work used the Self-Organizing Map (SOM) algorithm to explore and analyze atmospheric pollutants data from four air quality monitoring stations in Salvador-Bahia. The maps generated by the SOM allow identifying patterns between the air quality pollutants (CO, NO, NO<sup>2</sup> , SO<sup>2</sup> , PM<sup>10</sup> and O<sup>3</sup> ) and meteorological parameters (environment temperature, relative humidity, wind velocity and standard deviation of wind direction) and also observing the correlations among them. For example, the clusters obtained with the SOM pointed to characteristics of the monitoring stations' data samples, such as the quantity and distribution of pollution concentration. Therefore, by analyzing the correlations presented by the SOM, it was possible to estimate the effect of the pollutants and their possible emission sources.

**Keywords:** machine learning; atmospheric pollution; Self-Organizing Maps; Salvador-BA

#### **1. Introduction**

Air pollution is one of the crucial challenges of modern society. In recent years, pollution caused by industrial, vehicular, and toxic-chemical emission sources has increased significantly. This increase can be seen mainly in low- and middle-income countries, also called developing countries [1]. Despite the continuous pollution growth, awareness and pollution control programs are limited and receive little attention and financial resources from governments, international agencies, and philanthropic donors [1].

In addition, effectively managing regulations for controlling air pollution requires considerable knowledge about the costs and benefits. Currently, the primary efforts for measuring pollutants aim to avoid possible harm to people's health, such as respiratory or cardiovascular diseases that can result in hospitalizations and even death, usually affecting vulnerable groups of the population [2].

Complex mixtures of solid particles and gaseous pollutants contribute to air pollution. Among these are priority pollutants, commonly regulated by law and categorized as primary and secondary. The primary pollutants are substances that can be released directly

**Citation:** Costa, E.L.R.; Braga, T.; Dias, L.A.; Albuquerque, É.L.d.; Fernandes, M.A.C. Analysis of Atmospheric Pollutant Data Using Self-Organizing Maps. *Sustainability* **2022**, *14*, 10369. https://doi.org/ 10.3390/su141610369

Academic Editors: José Carlos Magalhães Pires and Álvaro Gómez-Losada

Received: 12 July 2022 Accepted: 17 August 2022 Published: 20 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

into the atmosphere, while the secondary pollutants are substances derivated from the primary ones through photochemical reactions in the troposphere [3]. Regarding the gaseous pollutants, to be particulary mentioned are sulfur dioxide (SO2), nitrogen dioxide (NO2), carbon monoxide (CO), volatile organic compounds (VOCs), solid materials or liquids suspended in the atmosphere due to their small size (called particulate matter (PM)), and the ozone (O3). The ozone is one of the major photochemical pollutants formed in the atmosphere by the reaction of nitrogen oxides (NOx) and hydrocarbons such as VOCs in the presence of sunlight, similarly to particulate sulfate and nitrate aerosols created from SO<sup>2</sup> and NO<sup>x</sup> [3].

The dispersion of atmospheric pollutants results from different elements such as temperature, relative humidity, atmospheric pressure, wind direction and speed, as well as topography [4]. Consequently, the complexity of analyzing and identifying pollutants and their primary sources in large-scale areas increases, which leads to the problem of positioning monitoring stations for data collection.

There are several emission sources of air pollutants, and a single source can emit several pollutants. For instance, the composition of fossil fuels used in motor vehicles can emit different pollutants during combustion and evaporation, or by the wear of tires and roads where vehicles run. Due to the increasing number of private vehicles, their emissions have become a dominant source of CO, CO2, VOCs, NO<sup>x</sup> and PM. Meanwhile, industrial processes normally include pollutants such as CO, PM, NOx, and SO<sup>2</sup> [4–6].

Thus, monitoring the concentration of pollutants in the environment at specific points is essential. Identifying the main components enables understanding of the current condition of air pollution, variations, correlations, and possible emission sources, which leads to the development of public policies to raise awareness and reduce pollutants. Therefore, many researchers have proposed the analysis of environmental data mainly using multivariate statistical techniques [4,7,8].

Multivariate statistical methods such as correlation or cluster analysis [9–13], and principal component analysis [7,14,15] are commonly applied in various studies to identify the correlation among parameters that can influence air quality. Large databases that carry various information about air pollution require techniques to extract and identify characteristics inherent to the analyzed data.

In this context, machine learning has proved to be a great alternative to the traditional methods used [16,17]. A well-known algorithm that belongs to the group of unsupervised learning algorithms is self-organizing maps (SOM) [18]. The SOM supports data dimensionality reduction and clustering. In addition, the SOM does not need to make assumptions about the parameters' distribution, as it is capable of dealing with non-linear problems of great complexity and dimension and is effective in using noisy data [19].

The SOM algorithm is adopted in many applications to analyze data from atmospheric pollutants. For example, in Ref. [20], the SOM is used to analyze data regarding air quality. In Ref. [21], the SOM is used to identify the level of pollution during foundry and land mining. The study carried out in [22] used the SOM to highlight the impact on air quality caused by the circulation of different air types, which alters the concentration of pollutants in the atmosphere. For this purpose, it is essential to identify suitable placements for positioning monitoring stations, as shown in [23]. Finally, the SOM has also been used to obtain particulate-matter characteristics in the atmosphere by evaluating its concentration in both internal and external exposure and connecting them to human activities. According to [24], the SOM can also function as a pollution identifier by defining limits to classify regions with low or high concentrations of a specific pollutant, such as ozone, enabling the evaluation of pollution zones.

Therefore, this work proposes an SOM implementation to study and analyze atmospheric pollutants to identify their patterns and characteristics. The main contributions are:

• A machine-learning-based approach for analyzing the air quality of Salvador monitoring stations, using the Government of Bahia State database—to the best of our knowledge, this work is the first to analyze this data using machine-learning algorithms.

• We discuss the common factors among meteorological parameters and pollutants and their clusters' impact on each monitoring station.

#### **2. Methodology**

#### *2.1. Case Study*

Salvador city (State of Bahia) has a territorial area of 693.453 km<sup>2</sup> and a population of 2,675,656 people. Located in the northeastern region of Brazil, it has an urban core and rugged topography formed by several columns and valleys, with a rainy tropical climate with no dry season and an average annual temperature of 25 ◦C.

The Government of Bahia State, through CETREL S. A., the company that operated the air monitoring stations from 2011 to 2016, provided the air quality database for this work. It contains the air quality data of a monitoring network constituted of eight stations. Nonetheless, we used data from four stations: Barros Reis (BR), Campo Grande (CG), Dique do Tororó (DT), and Itaigara (IT), due to their inherent characteristics. Figure 1 illustrates the stations' distribution in Salvador and highlights the four chosen. It is important to mention that this is the first air quality monitoring network ever installed in the city of Salvador. Therefore, this work portrays the first analysis of the pollutant and meteorological parameters in the database provided.

**Figure 1.** Location of the eigth air monitoring stations deployed in Salvador-BA.

#### *2.2. Dataset*

The dataset contains the hourly average of twelve features related to meteorological parameters and pollutants concentration. The meteorological parameters are wind speed (WS), ambient temperature (TEMP), relative air humidity (RH), the standard deviation of wind direction (STWD), rainfall, and wind direction. Meanwhile, the pollutants are SO2, CO, O3, particulate matter whose aerodynamic diameter is less than 10 µm (PM10) and the oxides of nitrogen NO<sup>2</sup> and NO. We removed the rainfall and wind direction variables due to the small amount of data available; thus, only ten features were used in our analysis.

We performed a data preprocessing step by removing the null lines, the measurement errors (identified by a specific terminology), and the outliers to improve the quality of the analysis. The outliers were removed by investigating the data dispersion and symmetry and, subsequently, using the quartile separatrix measure [25] to divide the dataset into three quartiles: *Q*1, *Q*<sup>2</sup> and *Q*3. Finally, based on the interquartile range (*AIQ*) [25], outliers with value greater than *Q*<sup>3</sup> + 3 × *AIQ* and less than *Q*<sup>1</sup> − 3 × *AIQ*, were removed from the database. We kept outliers with values greater than *Q*<sup>3</sup> + 1.5 × *AIQ* and less than *Q*<sup>1</sup> − 1.5 × *AIQ* to avoid a large reduction in the dataset. Table 1 presents the number of data samples for each monitoring station considered in our analysis and their period of operation.


**Table 1.** The operation period for each monitoring station provided by CETREL S. A., and the number of data samples available in the dataset before and after the preprocessing step.

In the meantime, Tables 2–5 present the dataset for the Barros Reis (BR), Campo Grande (CG), Dique do Tororó (DT), and Itaigara (IT) stations, respectively. As can be observed, all pollutants and atmospheric data are shown after the preprocessing step for each station in a concentration of pollutants in parts per billion (ppb).

**Table 2.** Descriptive statistics of pollutants and atmospheric data from the Barros Reis station (*P* = 21,559 samples).


**Table 3.** Descriptive statistics of pollutants and atmospheric data from the Campo Grande station (*P* = 24,559 samples).



**Table 4.** Descriptive statistics of pollutants and atmospheric data from the Dique do Tororó station (*P* = 42,037 samples).

**Table 5.** Descriptive statistics of pollutants and atmospheric data from the Itaigara station (*P* = 15,535 samples).


As can be observed, the BR station presents a higher concentration of SO2, CO, and PM10. The SO<sup>2</sup> has a maximum of 3.20 ppb and an average of 0.45 ppb due to the burning of fuels with sulfur. Meanwhile, the CO has a maximum of 2180 ppb and an average of 601.6 ppb, produced by burning organic fuels. The PM<sup>10</sup> has an average of 40.10 ppb, almost double the value of other stations; it is a solid or liquid material that remains suspended in the atmosphere that can cause a significant impact on human health.

The CG station also has a high level of CO, with a maximum of 1830 ppb and an average of 396.6 ppb. Regarding the presence of nitrogen oxides (NO and NO2), the CG and DT stations present higher average and maximum concentrations due to the combustion processes and atmospheric chemical reactions. Concerning the O3, a secondary pollutant formed in the atmosphere indicating the presence of photochemical oxidants, it has its higher concentrations recorded at the DT and IT stations.

Therefore, the SO2, CO, and NO pollutants present the most significant variations in concentration. These pollutants are mainly generated from the burning of fossil fuels. Hence, the station location and the intensity of the vehicle's traffic around its region can lead to different concentration records at certain times of the day. The datasets comprise 24 h of daily data collection.

All stations show similar measured values regarding the meteorological parameters, except for wind speed which has a high average at BR and IT stations, and the standard deviation of wind direction at CG. Note that the values were rescaled from 0 to 1 to improve the SOM results. In addition, this work performed the *z*-score normalization and logarithmic transformation, obtaining data with null mean and unit variance and reducing the data scale, respectively.

#### *2.3. Self-Organizing Maps (SOM)*

The Self-Organizing Map (SOM) is a neural network model widely applied to data dimensionality reduction and clustering [18,26]. The map consists of *M* neurons commonly arranged in a two-dimensional array representing the incoming data by shifting the neurons' position towards it. The maps' topology can be rectangular, hexagonal, or square, among others [18].

The *N*-dimensional input data sample can be characterized as

$$\mathbf{x} = [\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_N]. \tag{1}$$

Accordingly, each *i*-th neuron in the map is represented by a *N*-dimensional vector of weights expressed as

$$\mathbf{w}\_{i} = [w\_{i1}, w\_{i2}, \dots, w\_{iN}].\tag{2}$$

Therefore, the topology of a two-dimensional map with *M* neurons can be expressed as (*M<sup>h</sup>* × *Mv*), where *M<sup>h</sup>* is the number of neurons in the horizontal and *M<sup>v</sup>* is the number of neurons vertically; thus, *M* = *M<sup>h</sup>* × *Mv*.

The SOM algorithm iteratively molds the neurons' map to the input data topological form, based on a similarity metric, according to the following steps [18]:


$$j = \arg\min\_{i} \left||\mathbf{x}(p) - \mathbf{w}\_{i}||\_{\prime} \; i = 1, 2, \dots, M. \tag{3}$$

4. Update the BMU neuron and its neighboring neurons' weights according to following

$$\mathbf{w}\_{l}(t+1) = \mathbf{w}\_{l}(t) + \eta(t)h\_{l,j}(t)(\mathbf{x}(p) - \mathbf{w}\_{l})\tag{4}$$

where *η*(*t*) is the learning rate (ranging from 0 to 1) and *hi*,*j*(*t*) represents the BMU neighborhood function at the *t*-th iteration. The neighborhood function is described as

$$h\_{i,j}(t) = \exp\left(-\frac{d\_{i,j}^2}{2\sigma^2(t)}\right) \tag{5}$$

where *d* 2 *i*,*j* is the distance from the *i*-th neuron to the BMU (*j*-th neuron) and *σ* 2 (*t*) is the neighboring function size at the *t*-th iteration.

5. Repeat steps 2, 3 and 4 until the maximum number of iterations is reached, represented here by *T*.

The number of iterations must be enough to process the dataset samples several times; thus, *T* = *b* × *P*, where *b* is the repetition number that every set of *P* samples is presented to the SOM. Moreover, increasing the iteration number (*t*) decreases the radius of the neighborhood function, *σ* 2 (*t*). Consequently, the number of neurons nearby the BMU to be updated is reduced, strengthening their connection and similarities. After training the network, each *p*-th entry **x**(*p*) is associated with a specific BMU in the output layer, and entries that share similar patterns will be associated with the same BMU or its neighbors, which can be understood as a grouping in the SOM.

We applied the SOM to each monitoring station shown in Table 1. Each *p*-th sample in the dataset has *N* = 10 dimensions, 6 regarding atmospheric pollutants (SO2, CO, O3, PM10, NO, and NO2) and 4 concerning meteorological parameters (WS, TEMP, RH, and STWD). Therefore, the SOM enables analyzing the influence and characteristics of these variables.

#### *2.4. SOM Parameters*

The map size is the first parameter to be defined. For this purpose, it is necessary to determine the number of neurons to be used during training; in addition, avoiding a large or small number of neurons is vital to prevent problems such as non-identification of characteristics and overfitting [27]. Commonly, the number of neurons can be determined using the following heuristic equation

$$M \approx 5\sqrt{P} \tag{6}$$

where *P* is the number of input data samples [27].

Subsequently, the map topology (*M<sup>h</sup>* × *Mv*) was defined according to quality measures commonly used for the SOM network, the quantization error (QE) and topographic error (TE) [28,29]. For each station, different values of *M<sup>h</sup>* and *M<sup>v</sup>* were tested, in which *M<sup>h</sup>* × *M<sup>v</sup>* = *M* (Equation (6)). Finally, to analyze the results, three different types of normalization were applied to the data: *z*-score, min–max, and logarithmic.

Hence, all tests were performed with *b* = 500, a hexagonal topology, and the training algorithm was applied in two steps. Firstly, the learning rate and neighborhood function were initialized as *η*(0) = 0.5 and *σ* 2 (0) = *<sup>M</sup><sup>h</sup>* 2 , respectively, and decreased over iterations. Secondly, these values were fixed as *η* = 0.05 and *σ* <sup>2</sup> = 1. Tables 6–9 present the quality measures obtained for each test.

**Table 6.** SOM quality measures for Barros Reis Station data (best values in bold).


**Table 7.** SOM quality measures for Campo Grande Station data (best values in bold).



**Table 8.** SOM quality measures for Dique do Tororó Station data (best values in bold).

**Table 9.** SOM quality measures for Itaigara Station data (best values in bold).


Considering both QE and TE measures, the lowest values were obtained using min– max normalization. Thus, *M<sup>h</sup>* and *M<sup>v</sup>* were chosen according to the best result, being highlighted in each table.

#### **3. Results**

#### *3.1. U-Matrix, Components Plane and Parameter Similarity*

The SOM output can be represented by a unified distance matrix (U-matrix) and a component plane, both illustrated in Figure 2. The U-matrix provides a visualization of the relative distance between neurons in the map, which is evidenced through a color scale, and highlights the calculated distance between the adjacent neurons [18]. The closer the color approaches a dark blue in the U-matrix, the closer these neurons are, i.e., they have a more significant similarity. On the other hand, the closer the color approaches a dark red, the greater the distance between the neurons and their dissimilarity. In general, this form of representation allows us to consider that neurons with smaller distances form a cluster. In contrast, neurons with high distances can be considered as boundaries of a cluster.

The component plane shows the values of the weight vectors of each neuron through a color code, where the blue and red colors correspond to low and high values, respectively. This representation allows the recognition of parameter dependencies by comparing the patterns of each plane. The color gradient of a plane represents the parameters' value (component) for the analyzed samples. Each neuron is assigned a color according to the parameter value in that neuron; thus, it can be said that two or more parameters are related based on a comparison of their color gradients. A coherent gradient indicates a positive correlation, while an inverse gradient a negative correlation.

**Figure 2.** Unified distance matrix (U-matrix) and component planes of all analyzed variables (SO<sup>2</sup> , CO, O<sup>3</sup> , PM10, NO, NO<sup>2</sup> , WS, TEMP, RH, and STWD) from the Itaigara station.

#### *3.2. Itaigara Station*

Analyzing the component planes in Figure 2, it is possible to note that the relative humidity (RH) and temperature (TEMP) planes display inverse gradients, indicating a negative correlation between these parameters—something already expected given their characteristics. For CO, NO, and NO<sup>2</sup> pollutants, their weight vectors present a dark red color on the left side of the components' plane, with a higher concentration of high values at the top left side; hence evidencing a certain similarity between them. These pollutants are generated by combustion, and incomplete burning of organic fuels, which are very common in cities with a large circulation of vehicles (the leading emitter) [5].

The O<sup>3</sup> pollutant can be formed by the reaction of nitrogen oxides with VOCs. However, it presents a different pattern than NO2, which contributes to the formation of photochemical oxidants such as O3. As can be seen in the O<sup>3</sup> component plane, its high-value region is concentrated on the right side, similar to the wind speed component plane. Therefore, it can be said that the O<sup>3</sup> presence at the Itaigara Station probably came from another region carried by the wind, as it has a low concentration near traffic routes and is generated by photochemical reactions.

The PM<sup>10</sup> showed a different pattern than the other pollutants. Its main concentration region, with high weight vector values, is in the upper part of the plane. Since its emission sources are diverse, such as vehicles, biomass burning, industries, and dust resuspension, it is difficult to identify the major contributor pollutant. However, its formation can also be carried out in the atmosphere through VOCs, SO2, and nitrogen oxides.

The most distinct pattern presented was by SO2, with high values and concentration in the lower left part, it does not resemble any other component plane. This pollutant is released mainly by heavy vehicles burning diesel oil in urban areas.

An SOM arranges similar patterns in the same neighborhood region, clustering the network's output. Hence, an investigation into the clustering of samples provides important information about the data.

The U-matrix in Figure 2 illustrates how close or far the neurons are, showing their clusters. However, the cluster boundaries are not clearly represented, making it challenging to identify them. One of the methods for choosing the appropriate number of clusters is the so-called Davies–Bouldin index [30], an evaluation measure commonly used in SOM networks for validating clusters [31,32].

#### 3.2.1. Sample Grouping with the SOM Algorithm

For the Davies–Bouldin index, the lowest value found indicates the best number of clusters for the analyzed problem. Thus, an experiment was conducted by varying the number of clusters from two to eight and observing the obtained values. The best result was achieved for a total of four clusters.

Aftwards, a hierarchical analysis was performed to define the neurons belonging to the four clusters. For this purpose, the Euclidean distance was used as the similarity metric and the Ward neuron linking criterion, illustrated by the dendrogram shown in Figure 3. A dendrogram threshold value is defined for that to which cluster each neuron belongs (horizontal line in Figure 3).

**Figure 3.** Hierarchical analysis of the neurons clusters using the Ward linkage method and Euclidean distance for the Itaigara station.

In addition, based on the hierarchical analysis, the SOM neurons were classified in four clusters, as shown in Figure 4. Therefore, the samples assigned to each cluster and its neurons present the characteristics of the distribution of pollutants and meteorological parameters. Table 10 shows the mean value of samples for each parameter and cluster.

**Figure 4.** SOM neurons grouped into four clusters obtained by the hierarchical analysis of the Itaigara station.


**Table 10.** Parameters' average values for every cluster formed by the SOM network for the Itaigara station.

According to Table 10, cluster 1 samples exhibit, in general, a low concentration of air pollutants, except for O<sup>3</sup> and PM10, which have the highest average concentration. In addition, cluster 1 presents a wind speed and temperature considerably higher, and lower relative humidity. In total, about 34% of the data was assigned to cluster 1, thus sharing those characteristics.

Cluster 2, presented in Table 10, shows the lowest concentrations of SO2, CO, PM10, and NO pollutants, with intermediate values of O3, and NO2. It also presents the lowest average wind speed, intermediate temperature, and high relative humidity. In addition, cluster 2 is composed of 29% of the data, characterized by a low concentration of pollutants.

The highest concentrations of CO, PM10, NO, and NO<sup>2</sup> are found in cluster 3, as can be observed in Table 10. In contrast, SO<sup>2</sup> and O<sup>3</sup> show low values (with O<sup>3</sup> having the lowest total average among all clusters). The wind speed, temperature, and relative humidity have intermediate values. A total of 24% of the data was assigned to cluster 3, characterized by high pollutant concentration values.

Finally, cluster 4 is mainly characterized by the high concentration of the SO<sup>2</sup> pollutant compared to the others. The other pollutants present intermediate concentration values, as well as wind speed, temperature, and relative humidity. In addition, cluster 4 has the lower amount of samples; a total of 2027 (13%) were assigned here.

#### 3.2.2. Parameter Correlation

The component planes allow an initial and preliminary analysis of parameters through their visual gradients which, in a certain way, can turn out to be subjective and discretionary. Thus, to carry out a more objective and effective analysis of the results, a correlation analysis was applied between the component planes seen in Figure 2. Figure 5 shows the similarity between the planes (parameters) using the Ward criterion and the Pearson correlation coefficient, *r*.

As can be observed in Figure 5, two main branches are seen in the correlation analysis. The first branch, on the left of the figure, includes all the pollutants studied but O3, whose origin is exclusively photochemical. Hence, O<sup>3</sup> is clustered with the wind speed and temperature.

**Figure 5.** Parameter correlation using Ward criterion and distance 1 − *r*, where *r* is Pearson coefficient, for the Itaigara station.

The NO, NO2, and CO pollutants have a substantial similarity, probably due to a similar emission source such as vehicular, given the station allocation and the monitoring region. Those pollutants are correlated to PM10, which also has a vehicular origin. In addition, the PM<sup>10</sup> is connected to STWD, showing that intensive vertical turbulence (atmospheric instability), which is characterized by high STWD values, increases the PM<sup>10</sup> concentration. Thus, it can be said that the wind movement is dragging out PM<sup>10</sup> from other areas or causing the resuspension of particulate material at Itaigara station. In addition to vehicle influence, the particulate matter may also be dispersed by the existing vehicle movement, the wear of traffic lanes, and the vehicles' brake pads.

The similarity between RH and SO<sup>2</sup> shows the influence of RH on the formation or decomposition process of molecules during the heterogeneous procedure (liquid phase). In particular, the SO<sup>2</sup> can react with the air humidity and other oxidants in the atmosphere to form sulfuric acid H2SO<sup>4</sup> and ammonium sulfate [33].

Meteorological parameters, such as wind speed, considerably influence the O<sup>3</sup> pollutant [24]. Given the similarity between O3, the wind speed, and temperature (Figure 5), we consider that O<sup>3</sup> is not generated at the monitoring station site but instead transported by winds along with other pollutants such as VOCs. The temperature may also be responsible, since high temperatures result from the increase in the speed of chemical processes, generating ozone in the region.

#### *3.3. Barros Reis Station*

In the BR station component planes (Figure 6), the weight vectors for the PM10, CO, NO, and NO<sup>2</sup> are displayed similarly across the map. The concentration of high values is on the upper left side, with average values in the nearby regions. The low values are located mainly in the lower right region of the map. All these pollutants can be formed from combustion processes, which shows the similarity obtained and, in particular, if they have a common source.

Unlike the pollutants discussed above, the O<sup>3</sup> component plane has its highest concentration at the bottom right of the map. O<sup>3</sup> is a secondary pollutant, i.e., its formation depends on atmosphere reactions from other pollutants, such as NO2. Still, its plane does not resemble the planes of primary pollutants. Similarly, PM<sup>10</sup> is also a secondary pollutant but is formed by SO2, and no similarity is seen in their plane. However, PM<sup>10</sup> can also be obtained from VOCs and nitrogen oxides, showing a relationship between their planes.

The SO<sup>2</sup> plane displays a unique pattern, with its highest values concentrated in the upper right region of the map, showing no similarity with the other pollutants. The component planes referring to meteorological parameters showed different distributions, with a negative correlation between TEMP and RH. At the same time, the high WS values are concentrated in the upper central region, and STWD with values dispersed throughout the map.

**Figure 6.** Unified distance matrix (U-matrix) and component planes of all analyzed variables (SO<sup>2</sup> , CO, O<sup>3</sup> , PM10, NO, NO<sup>2</sup> , WS, TEMP, RH and STWD) from the Barros Reis station.

#### 3.3.1. Sample Grouping with the SOM Algorithm

Figure 6 presents the clusters through the U-matrix, representing the neurons with their distance to adjacent neurons. The cluster number was defined with the Davies–Bouldin index by varying it from two to eight, reaching the best result for three clusters.

Subsequently, a hierarchical analysis was performed to define the neurons belonging to the three clusters. Thereupon, the Ward criterion and the Euclidean distance were used as similarity metrics. Figure 7 displays the dendrogram obtained with the threshold value used for segregation. Meanwhile, Figure 8 shows how the clusters were arranged on the map.

**Figure 7.** Hierarchical analysis of the neurons clusters using the Ward linkage method and Euclidean distance for the Barros Reis station.

**Figure 8.** SOM neurons grouped into three clusters obtained by the hierarchical analysis of the Barros Reis station.

The samples are linked to a particular neuron belonging to one of the three clusters, allowing the analysis of the sample's distribution regarding the clusters.

Table 11 shows the average values of every parameter according to the cluster. As can be seen, cluster 1 represents the samples with the lowest pollutant concentration, except for O<sup>3</sup> which has a median value among the others. Meteorological parameters such as wind speed, temperature, and relative humidity also have low values. In total, the cluster has 10,599 samples with these characteristics, corresponding to 49.16% of the station data.


**Table 11.** Parameters average values for every cluster formed by the SOM network for the Barros Reis station.

In the meantime, cluster 2 exhibits the highest concentration of pollutants, displaying a considerable difference from the values of other clusters except for O3, which has the lowest average value obtained. Similar to cluster 1, the wind speed, temperature, and relative humidity also have low values. Cluster 2 has 4183 samples, equivalent to 19.40% of the data.

Finally, the samples assigned to cluster 3 have an intermediate value of pollutants concentration, with average values between the clusters 1 and 2 range, except for O<sup>3</sup> which has the highest average concentration recorded. In addition, cluster 3 has 31.44% of the station data with the highest wind speed and the lowest relative humidity.

#### 3.3.2. Parameter Correlation

The component planes, shown in Figure 6, present the correlation between parameters. Meanwhile, Figure 9 presents the parameters similarity obtained using the Ward linking method and the Pearson correlation coefficient.

**Figure 9.** Parameter correlation using Ward criterion and distance 1 − *r*, where *r* is Pearson coefficient, for the Barros Reis station.

As shown in Figure 9, there is a substantial similarity between CO and NO pollutants. Given the BR station characteristics (located in between two avenues), it can be said that motor vehicles are the primary emission source of those pollutants. Likewise, the NO<sup>2</sup> and PM<sup>10</sup> pollutants are also emitted by combustion in vehicles; in addition, they can be formed secondarily by photochemical processes. Regarding SO2, it can be said that the primary emission source is the burning process of fuels, such as diesel and gasoline, from heavy vehicles such as trucks, buses, microbuses, and light vehicles.

Unlike other pollutants, the O<sup>3</sup> showed a clear relationship with meteorological parameters such as wind speed and temperature, similar to the Itaigara station. Nonetheless, this relationship with meteorological parameters is not strong as in other stations.

The STWD indicates the local atmospheric stability. Its inverse relationship with RH can be related to the regions' water molecules' dissipation. Hence, the data regarding pressure and heat could improve the analysis precision by demonstrating the influence of the wind direction. The RH and STWD present a negative relationship with the other pollutants, consequently leading to the non-contribution or reduction in the present concentrations.

#### *3.4. Campo Grande Station*

Figure 10 illustrates the component planes for the CG station. Concerning the planes of nitrogen oxide, a significant similarity between NO<sup>2</sup> and CO can be observed, with high values concentrated in the central part of the map. The NO plane is also similar to the CO and NO2, but the high values are concentrated in the region to the right, while median values are concentrated in the map center. The emission source of these pollutants is fuel combustion, especially from vehicles.

The SO<sup>2</sup> has high values concentrated in the lower right region of the map. The PM10, on the other hand, did not show significant pattern similarities with other planes, having a higher concentration in the upper part of the map and moderate concentration in the lower part, equivalent to small regions of the SO<sup>2</sup> and NO<sup>2</sup> planes. Likewise, the O<sup>3</sup> pollutant also shows no similarity with other component planes. Despite its formation, resulting from the reaction between NO<sup>2</sup> and VOCs, its concentration of high values is located at the map edges, having similarities with the concentration regions of high values of meteorological parameters, such as WS, TEMP, RH, and STWD.

**Figure 10.** Unified distance matrix (U-matrix) and component planes of all analyzed variables (SO<sup>2</sup> , CO, O<sup>3</sup> , PM10, NO, NO<sup>2</sup> , WS, TEMP, RH and STWD) from the Campo Grande station.

#### 3.4.1. Sample Grouping with the SOM Algorithm

To identify the CG station clusters through the U-matrix, illustrated in Figure 10, the Davies–Bouldin was used and the cluster number varied from two to eight. The best result was obtained for five clusters. Aftward, the neurons belonging to each cluster were obtained according to a hierarchical analysis defined based on the *Ward* method and Euclidean distance. Figure 11 shows the resulting dendrogram and the segregation threshold. Meanwhile, Figure 12 displays the neurons distribution regarding the clusters.

**Figure 11.** Hierarchical analysis of the neurons clusters using the Ward linkage method and Euclidean distance for the Campo Grande station.

**Figure 12.** SOM neurons grouped into five clusters obtained by the hierarchical analysis of the Campo Grande station.

Each CG station dataset sample was integrated into the cluster with the neuron it most resembles. Thus, an analysis was performed regarding the samples' distribution by cluster based on the average values of parameters, as shown in Table 12.


**Table 12.** Parameters average values for every cluster formed by the SOM network for the Campo Grande station.

According to Table 12, cluster 1 has the lowest average values of concentration for the SO2, CO, NO, and NO<sup>2</sup> pollutants, while the PM<sup>10</sup> and O<sup>3</sup> show intermediate values. Moreover, the wind speed and temperature are the lowest of all. Cluster 1 consists of 6640 data samples, equivalent to 27.04% of the dataset.

Cluster 2 also presents low average values of the concentrations of the pollutants, with values slightly higher than those obtained in cluster 1, except for the O<sup>3</sup> pollutant, which has a higher concentration average. Similar behavior can be seen for the meteorological parameters except for the wind speed, which shows the highest average among all clusters. In total, 16.96% of the data was assigned to cluster 2.

The samples assigned to cluster 3 present intermediate values for all pollutants concentration and meteorological parameters, where the temperature has the highest average and the relative humidity the lowest. This cluster has 25.34% of the data.

The highest concentrations of CO, PM10, NO, and NO<sup>2</sup> are found in cluster 4, with an intermediate concentration of SO<sup>2</sup> and the lowest concentration of O3. Meanwhile, all meteorological parameters showed intermediate values compared to other clusters. Cluster 4 has a total of 17.22% of the data.

Meantime, cluster 5 stands out with the highest average concentration of the SO<sup>2</sup> pollutant. The other pollutants, as well as the meteorological parameters, present intermediate average values. In total, 7.44% of the data was assigned to cluster 5.

#### 3.4.2. Parameter Correlation

The hierarchical representation for the CG station was obtained with the *Ward* method and the *Pearson* correlation coefficient. Figure 13 presents the parameters' correlation obtained.

**Figure 13.** Parameter correlation using *Ward* criterion and distance 1 − *r*, where *r* is *Pearson* coefficient, for the Campo Grande station.

First of all, the similarity between CO, NO, and NO<sup>2</sup> pollutants can be seen. These pollutants are emitted in urban areas mainly by motor vehicles, and their similarity validates the idea of a potential common emission source. The temperature is also similar to those three pollutants as it contributes to chemical processes that form them—for example, the NO<sup>2</sup> results from the sunlight action on NO. Thus, the temperature can impact the amount of those pollutants present in every season.

The PM<sup>10</sup> is a primary and secondary pollutant, and it is correlated to SO2. Thus, its atmospheric formation can be linked to gases turning into particles due to chemical reactions in the air, such as sulfur dioxide. The SO<sup>2</sup> is generated from the burning of fuels with sulfur in its composition, such as diesel oil or industrial fuel oil, and it appears to be related to the PM<sup>10</sup> due to motor vehicle emissions, among other processes.

The photochemical oxidant, O3, has a certain correlation with the wind speed, but with a much lower similarity than that presented by the Itaigara station. In addition, there is no apparent relationship with the temperature. The RH has a negative relationship with O<sup>3</sup> and wind speed, which may be a consequence of solar radiation; low RH concentrations are related to a high solar incidence and, therefore, a greater disposition to the O<sup>3</sup> formation.

#### *3.5. Dique do Tororó Station*

The SOM network component planes for the DT station are shown in Figure 14. The pollutants that are mainly emitted by combustion processes, such as CO, NO, NO2, and PM<sup>10</sup> showed similar distribution patterns of values, with the highest concentration from the left side to the upper left side of the map. In contrast, the PM<sup>10</sup> has higher values at the bottom of the map, similar to the temperature and wind speed.

**Figure 14.** Unified distance matrix (U-matrix) and component planes of all analyzed variables (SO<sup>2</sup> , CO, O<sup>3</sup> , PM10, NO, NO<sup>2</sup> , WS, TEMP, RH and STWD) from the Dique do Tororó station.

As in the other stations, the SO<sup>2</sup> showed a different pattern from the other pollutants, with regions of high values concentration at the edges of the map. However, one of the high-concentration edges slightly coincides with those of the CO, NO, and NO2. Lastly, the O<sup>3</sup> displays high values at the lower right region of the map, with a similar distribution to the wind speed plane. The other planes, such as relative humidity and STWD (which can influence the concentration of pollutants), showed patterns with well-defined regions at the top of the map.

#### 3.5.1. Sample Grouping with the SOM Algorithm

The map neurons, represented by their respective distances to adjacent neurons in the U-matrix (Figure 14), were used to visualize and determine the clusters. For this purpose, the Davies–Bouldin index was used, and the number of clusters varied from two to eight, resulting in the best amount with three clusters. Again, hierarchical analysis was carried out using the Ward criterion and Euclidean distance. Figure 15 illustrates the dendrogram, and Figure 16 the segregation borders of the map.

**Figure 15.** Hierarchical analysis of the neurons clusters using the Ward linkage method and Euclidean distance for the Dique do Tororó station.

**Figure 16.** SOM neurons grouped into three clusters obtained by the hierarchical analysis of the Dique do Tororó station.

Each cluster was assigned a certain number of samples according to their characteristics. Table 13 presents the concentration averages of each pollutant according to the cluster.

**Table 13.** Parameters average values for every cluster formed by the SOM network for the Dique do Tororó station.


As can be seen in Table 13, cluster 1 represents the samples with the highest mean value of O<sup>3</sup> and intermediate values of the other pollutants (SO2, CO, NO, NO2, and PM10). The highest concentration value is the wind speed, while relative humidity and STWD are the lowest. Cluster 1 has 26,101 samples sharing its characteristics, equivalent to 62.09% of the station data.

The pollutants in cluster 2 had the highest average concentration, except for O<sup>3</sup> which showed the lowest concentration among all clusters. The wind speed presents low values, and the temperature parameter is the highest. In total, 17.87% of data constitutes this cluster.

Finally, the pollutants in cluster 3, that is, CO, NO, NO2, and PM10, had the lowest average concentrations, with SO<sup>2</sup> and O <sup>3</sup> showing intermediate values. The temperature and wind speed parameters have the lowest values found and the relative humidity the highest. Cluster 3 represents 20.04% of the station data with 8425 samples.

#### 3.5.2. Parameter Correlation

The DT station component planes, shown in Figure 14, presents the parameters correlation. Meanwhile, Figure 17 illustrates the parameter similarity obtained through the *Ward* criterion and the *Pearson* correlation coefficient.

**Figure 17.** Parameter correlation using *Ward* criterion and distance 1 − *r*, where *r* is *Pearson* coefficient, for the Dique do Tororó station.

As can be seen in Figure 17, the CO, NO, and NO<sup>2</sup> pollutants have the most significant similarity, a characteristic also observed for other stations. All stations are located in urban centers with a large flow of vehicles, leading to the possibility of a common emission source of these pollutants, mainly coming from the local vehicular fleet. The PM<sup>10</sup> also showed a certain similarity with those pollutants, indicating a possible emission from fuel burning. The temperature parameter at the DT is also related to the mentioned pollutants, different from other stations where it is associated with O3. In addition, the temperature can contribute to NO<sup>2</sup> formation and PM<sup>10</sup> in secondary processes.

Like Barros Reis station, the RH and the STWD at the DT station are somewhat similar but with a positive coefficient. The RH and STWD can be influenced by atmospheric parameters such as pressure and heat and, consequently, the wind conditions and water particles.

The SO2, different from the Itaigara station, is not correlated to either the PM<sup>10</sup> or RH, as it is probably being generated by an independent source and not reacting to other pollutants.

Given that O<sup>3</sup> is a secondary pollutant, it was only correlated with wind speed, with no apparent similarity with temperature or nitrogen oxides. Therefore, its concentration at the DT station may be transported by the wind accompanied by other pollutants.

#### **4. Discussion**

The SOM implementation presented in the previous sections identifies the correlation among different air quality parameters for many monitoring stations. The SOM component planes provide a visual representation of the similarities between pollutants and meteorological parameters, simplifying their analysis and highlighting peculiarities.

Usually, the CO, NO, and NO<sup>2</sup> pollutants were related, showing higher similarities. On the other hand, the meteorological parameters differed from PM<sup>10</sup> and SO2. The RH and STWD parameters at Barros Reis station showed a negative correlation, unlike at Dique de Tororo station, where a positive correlation was presented. At Itaigara station, the influence of atmospheric stability was identified through the relationship between STWD and PM10. Meanwhile, Campo grande station shows some degree of similarity between PM<sup>10</sup> and SO2. These relations are essential to identify the influence of meteorology on air-pollutants concentrations and information employed to create strategies for mitigating air-pollution critical episodes.

Unlike other pollutants, the O<sup>3</sup> presents a more significant link with meteorological parameters such as WS, as seen at Itaigara, Dique do Tororó and Campo grande stations. Thus, we can infer that the wind is mainly responsible for the transport of O3. In addition, the correlation of the TEMP, WS, and O<sup>3</sup> parameters at Barros Reis and Itaigara stations indicates an increase in O<sup>3</sup> resulting from chemical processes, probably due to the influence of solar radiation.

The data of Dique do Tororo and Barros Reis stations were grouped only into three clusters, with their cluster 1 emphasizing a large number of samples with higher concentrations of O3. In contrast, the other clusters present a sample distribution with intermediate to high concentrations for the CO, NO, NO2, PM10, and SO<sup>2</sup> pollutants. Meanwhile, the Itaigara station has four clusters, with one mainly characterized by the SO<sup>2</sup> pollutant; the remaining clusters are defined by higher concentrations of CO, NO, NO2, PM10, and O3. Similar to Itaigara, the Campo Grande station has one cluster (out of five) where SO<sup>2</sup> is predominant, while the other clusters display low and high concentrations.

Commonly, studies about atmospheric pollutants rely on methods such as principal component analysis (PCA) and hierarchical analysis to define clusters based on similarity. For example, the studies carried out by [8,34] describe the clusters' characteristics according to the percentage of their main components' variance, thus, indicating which variables have more significance for their definition. Meanwhile, by applying a hierarchical classification on the SOM neurons, we can obtain the variables' concentration value and influence on defining each cluster.

In the meantime, in [13,35], the number of clusters is fixed for all monitoring stations, and the k-nearest neighbors provide a relationship between the defined clusters of each station. However, the SOM also allows an individual characteristic analysis of each pollutant, like in [35].

Thereby, the SOM enables finding similarities and estimating the link between parameters more deeply. As described in this work, the SOM can obtain data patterns and cluster characteristics and demonstrate the parameters' influence, which is not trivial in other techniques. Additionally, it can also deal with the non-linearity complexity of air pollution data [36], simplifying the analysis process and increasing its precision; this shows the advantage of using a machine-learning-based approach compared to traditional methods.

#### **5. Conclusions**

We implemented an SOM to analyze the air-quality data of four stations in the monitoring network of Salvador, Brazil. A detailed discussion regarding pollutants and their correlation with meteorological parameters is provided, assisting in estimating possible common emission sources and the influence of meteorological parameters. The latter permits the establishment of relations between meteorology and pollutants concentration, which is vital for developing, for example, alert systems to identify critical episodes of air pollution or for assisting in developing strategies to improve air quality.

The SOM outputs enabled the identification of data particularities concerning the parameters analyzed. For example, the data samples' concentration of Dique do Tororo and Barros Reis stations showed a cluster with a high concentration of O3. In contrast, the other clusters presented well-defined contributions of remaining pollutants. The Itaigara and Campo Grande stations presented a more detailed definition regarding the clusters of (1) CO, NO, NO2; (2) MP10; (3) O3; and (4) SO2. Thus, the SOM also allows an analysis of the particularities of each cluster.

The results showed that the SOM could identify characteristics, describe similarities, recognize patterns, and define clusters of air-pollution problems. Unlike traditional methods, the SOM proved to be a good tool for studying atmospheric pollutants, providing several aspects that can contribute to and improve discussions in this area. To the best of our knowledge, this is the first study to analyze Salvador's air-quality monitoring database. Therefore, the tool developed and the results presented and discussed here can assist further studies and aid in the development of public policies for pollution management.

**Author Contributions:** All the authors have contributed in various degrees to ensure the quality of this work (e.g., E.L.R.C., T.B., L.A.D., É.L.d.A. and M.A.C.F. conceived the idea and experiments; E.L.R.C., T.B., L.A.D., É.L.d.A. and M.A.C.F. designed and performed the experiments; E.L.R.C., T.B., L.A.D., É.L.d.A. and M.A.C.F. analyzed the data; E.L.R.C., T.B., L.A.D., É.L.d.A. and M.A.C.F. wrote the paper. É.L.d.A. and M.A.C.F. coordinated the project). All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)—Finance Code 001.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors wish to acknowledge the financial support of the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for their financial support. The authors want to thank the "CETREL S. A. Company and Bahia State Government" for the availability of the monitoring data in Salvador.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

