**3. Methodology**

## *3.1. Framework*

Figure 1 shows the proposed framework for an AQDSS. There are three key stakeholders identified, namely the individual, the healthcare industry, and the government. The three pillars at the center represent the elements that are directly linked to the government. They include pollution laws, sectorial regulations, and incentives, all of which make up the air quality monitoring policies of the government.

**Figure 1.** Proposed framework for the Air Quality Decision Support System (AQDSS).

As pointed out previously, these governmen<sup>t</sup> regulations play an important role, as they largely support and drive policies that enable the measurement and access to air quality data.

The other two stakeholders are the individuals and the healthcare sector, which form the apex of the framework. They are supported by five layers of activities, as illustrated by the pillars on both sides. The first three on the left correspond to the analysis of past data to estimate PAPE and their related health impacts. The two pillars on the right represent future possibilities of forecasting PAPE and the associated health risks.

Although the estimation of PAPE is fundamental to the entire AQDSS, its associations with forecasting and as a predictive health risk assessment system are noteworthy. Air quality forecasting techniques are already being explored in current research [59,60] in environmental modeling literature, where their important contributions to the development of control measures to prevent damage to human health have been highlighted.

This proposed framework can be adopted to aid in the development of an AQDSS and various IoT applications. For instance, consider a mobile application that allows an individual to select the best route to travel from home to work that minimizes the risk to pollution exposure. This could be achieved by employing different modeling techniques to continuously analyze real-time air quality data and forecast PAPE values for each of the possible routes to the destination. Actual PAPE data that are stored in the database can also be used for epidemiological studies and air quality policy development. This could be one of the applications of the proposed framework when all identified pillars are fully employed. In this paper, however, as a proof of concept, we focus mainly on PAPE measurement (pillar 2) in accordance with the system architecture that is illustrated on the base of the framework. In order to manage sensor data in an interoperable way, this implementation considers the Web Service Description Language provided by the Sensor Observation Service v2.1 (SOS) from the SOS-OGC consortium. This standard defines a Web service interface which allows observation queries, sensor metadata, as well as representations of observed features. Furthermore, this standard defines a means to register new sensors and to remove existing ones. Also, it defines operations to insert new sensor observations. The feasibility was assessed through the developed case study.

#### 3.1.1. System Architecture

As indicated in the framework, the base shows the set of activities that are related to the gathering and managemen<sup>t</sup> of all air quality and personal data. This groundwork is required for the entire system to function. There are five different sources of data. They are the (S1) outdoor pollution monitors, (S2) location tracking application, (S3) indoor pollution monitors, (S4) e-beacons, and (S5) meteorological monitors. S1 and S2 are intended for outdoor pollution modeling, and S3 and S4 are for indoor pollution modeling. We also consider meteorological data, as they are relevant for air quality prediction studies [60,61].

The data are extracted from the mentioned data sources and stored in a database managemen<sup>t</sup> system. Data mining, numerical modeling, and geostatistics, as shown in the center of the framework are the key activities that support the entire system, as it is a continuous process to discover and analyze spatio-temporal data.

#### 3.1.2. PAPE Measurement

There are essentially three elements to consider when measuring PAPE. They are (1) outdoor pollution, (2) indoor pollution, and (3) the individual's location pattern. With respect to the outdoor pollution, the mobile-phone-based tracking app provides the time and location data of the individual in the outdoor environment. An outdoor pollution map is created by using potentially different strategies, such as


Each of these techniques has its specific advantages and limitations, and its consideration in a specific application will support its choice. Actually, the latest family of methods has specific advantages, as it is suitable for working with the fixed network of pollution stations the city has implemented. Indeed, it also deals with the limitations of sparse data, as Data fusion can increase the reliability of data as well as it can contribute to dealing with local effects like street canyons, etc., by using the street granularity-based IoT air quality stations some cities are deploying, such as Airbox in Taipei [67] and Array of Things (AoT) sensor boxes in Chicago [68].

For the duration of time that an individual is outdoors, the corresponding pollution data are estimated by superimposing the developed outdoor pollution map over the collected location pattern data.

For the indoor pollution, the e-beacons indicate the period when the individual is indoors, and the indoor air quality monitors provide the corresponding air quality data, when available. Failing this, outdoor information will be used by default. The integration of personal mobiles and fixed e-beacons located in different indoor micro-environments enables the individual's time-location information to be understood. The corresponding time-location knowledge combined with location-specific indoor air quality information collected from air monitoring devices can provide a detailed picture of personal exposure in the indoor environment.

Both outdoor and indoor data are then integrated, and statistical modeling techniques are employed to either estimate or forecast the individual's PAPE.

#### *3.2. Madrid Case Study*

In order to assess the feasibility of the proposed IoT application that measures PAPE and contributes to empowering users because of the relevant figures provided at the personal level, we conducted a case study to analyze significant functionalities.

#### 3.2.1. Study Area

The study area was the City of Madrid, which is the capital of Spain, as well as its largest municipality. It was the first city in Spain to have air quality monitoring stations and has always been at the forefront of the fight against air pollution. In response to the most recent EU directive (Directive 2008/50/EC) regarding the establishment of limits to major air pollutants, the Madrid governmen<sup>t</sup> has committed to maintaining acceptable pollution levels by continuous air quality monitoring.

The Madrid air pollution monitoring network consists of 24 fixed-site outdoor monitors (Figure 2). The hourly averaged measurements of SO2, CO, NO, NO2, PM10, PM2.5, C6H6, toluene (C6H5–CH3), hexane (C6H14), propene (C3H6), m-xylene, o-xylene, and methane (CH4) hydrocarbons can be downloaded free of charge from the official open data website of the *Ayuntamiento de Madrid* [69]. Meteorological data, such as temperature, humidity, ultraviolet radiation, pressure, solar radiation, rainfall, precipitation, diffuse solar radiation, global radiation, wind speed, and wind direction can also be accessed through the website of the *Agencia Estatal de Meteorologia* [70].

**Figure 2.** Air Quality Monitoring Network of Madrid.

#### 3.2.2. Data Collection

In this case study, S1, S2, S3, and S4 data sources were used and meteorological data were excluded (see Figure 1). We had one individual volunteer whose activities were monitored during the study period.

For this particular case, Madrid does not ye<sup>t</sup> implement the street-based pollution monitoring strategy, but based on similar studies [43,44], the research team adopted the geostatistics-based approach, as it becomes linear scalable with time and is suitable for integrating additional data sources. Therefore, outdoor pollution figures were downloaded from the mentioned open data website of Madrid City Hall. For location tracking, we used the mobile app Moves [71] in which time, location, and activity were accessed through an open Application Programming Interface (API). Other similar open source mobile apps are widely available, such as OwnTracks [72], Miataru [73], and Geo2Tag [74].

For the indoor pollution, Foobot indoor monitors were used. One of them was placed in the individual's workplace, as this is where she spends most of her indoor time. The indoor pollution data were retrieved from Foobot's API [75]. The e-beacon devices were placed in proximity to the indoor monitors. They helped us to determine whether the individual was within the indoor vicinity. The e-beacon data were broadcasted through Eddystone, an open-source beacon format, and were retrieved through an app that we developed in Cordova [76]—a free and open-source platform for building mobile applications.

All of these data sources promote the scalability of the proposed IoT application, as most are publicly available without charge. The only costs incurred were for the indoor monitor and e-beacons. E-beacons, however, are low-cost, small enough to attach to any surface, and are finding an increasing number of location-based applications in various industries such as retail and transportation as well as in households [77]. Hence, beacon technology offers a promising solution for indoor location tracking.

All data that were collected from the mentioned data sources were processed as indicated in the Data in Brief Collection documents that were submitted to the journal for this paper. The developed code can be also found in a public repository [78].

The selection of the pollutants used for the PAPE estimation was based primarily on the data provided by the devices, which also agreed with the data on the most common air pollutants that have been widely studied previously [23,24]. Table 1 shows the available pollutants for each of the data sources used.


**Table 1.** Available Pollutants for each Data Source.

#### 3.2.3. Outdoor Pollution Modeling

Existing studies of PAPE essentially rely on modeling techniques in which data collected from fixed-site outdoor monitors are used to estimate pollution at specific geographic locations. To create an outdoor pollution map, there are several alternative methods. These include using micro meteorological numerical-based models (WRF, CMAQ, etc.) [79], or machine-learning-based models [64]. However, for the sake of simplicity and considering the computational costs and the

number of potential users, we adopted some classical but still cost-effective approaches like the Inverse Distance Weighting (IDW), Simple Kriging, Ordinary Kriging, and Co-Kriging algorithms.

Table 2 shows the formulae and main characteristics of these techniques in which *z*-0 is the measured value at the prediction location, *λi* is the weight of the measured value at the ith location, and *Xi* is the measured value at the ith location. The parameters that were tuned are also indicated in the table.

All three methods estimate the value at a particular location by assigning a weight of the surrounding known values and calculating the weighted sum of the data. These techniques differ mainly in the calculation of the assigned weight *λi*. Kriging, which is a geostatistical method, offers advantages over other interpolation techniques, as it provides an interpolation error estimate, and it is an exact interpolation. The interpolations are based on weights that do not depend on data values [80]. The advantages of the deterministic interpolation technique IDW, on the other hand, are that it is simple, intuitive, and computes the interpolated values quickly [81]. We created the outdoor pollution map by employing these three interpolation techniques in R, an open-source statistical modeling software.



#### 1. Optimal Parameters and Model Selection

In order to select the optimal parameters and the best modeling technique for each of the hourly outdoor pollution datasets, a 5-fold cross validation was performed to avoid overfitting. For each of the 24-hourly datasets and each of the three modeling techniques and all combinations of their respective parameters, the selection of optimal values was based on the root-mean-squared-error (RMSE) metric. The dataset was separated into two parts, training and testing, which were used to fit the model and calculate errors, respectively. The parameters and the model that provided the least RMSE were selected.

For the Simple and Ordinary Kriging techniques, the weights *λi* were derived by fitting a covariance function or variogram. First, a graph of the empirical variogram was plotted and a model was fitted to the points based on this plot. Table 3 shows the different models and functions from which to choose when fitting a model to the empirical variogram. Based on the 5-fold cross validation, the Gaussian Model was selected as the optimal configuration.

#### 2. Outdoor Pollution Map

Similar to [43], an hourly outdoor pollution map was created that was based on the identified optimal parameters and modeling technique for each respective hour. Figure 3 shows an example of the pollution maps based on the PM2.5 pollution data on 2017-03-24. It shows that, from midnight to the morning at around 6:00, the highest pollution levels consistently occurred in the southwestern part of the city and moved towards the north with maximum levels that ranged from 8 to 12 μg/m3. Concurrently, high pollution levels were also experienced in the northwestern part of the city at midnight and in the northeastern part at 01:00 in the morning.

The selection of time frequency (hourly-based in this case) also impacts the accuracy, depending on how spiky the pollution looks. In Madrid, the pollution sources are strongly related to traffic and then variations are smooth [82]. Therefore, hourly-based frequency is a rather convenient basis for calculations.


*d*= distance between two locations, *c*0 = y-intercept, *α* = range.

#### 3.2.4. Indoor Pollution Modeling

The main data sources used to model the indoor pollution were the e-beacons and indoor monitor. The timestamps recorded from the e-beacons provide the time when the individual was detected indoors.

In this study, we refer to "indoor" as the work location, since the indoor monitor was only present at the individual's workplace. The "outdoor" environment, on the other hand, refers to any other location outside the workplace. To obtain the corresponding pollution values during these periods, each of these timestamps was matched to the closest timestamp logged from the Foobot device. As illustrated in Figure 4, the pollution values were then aggregated in time periods based on Equation (1) on the assumption that, if the difference between two sequential timestamps recorded on the e-beacons was more than 10 min, the individual was outdoors and a new indoor period would start. For instance, in Figure 4, from 2017-03-24 at 14:37:46 to 2017-03-24 at 14:44:59, the pollution levels were aggregated, since the timestamp immediately following 2017-03-24 at 14:44:59 is 2017-03-24 at 14:59:00 and the difference is longer than 10 min. Therefore, during the period between 2017-03-24 at 14:44:59 and 2017-03-24 at 14:59:00, the individual was outdoors and the new indoor period resumed at 2017-03-24 14:59:00.

Similarly, micro-environments (office, printing room, meeting room) in the workplace could be replicated by deploying e-beacons and air monitoring devices in all available micro-environments.

**Figure 3.** Co-kriging interpolation of PM2.5 on 24 March 2017.

The PAPE Exposure(p) in period p inhaled by the individual was calculated by multiplying the pollution value SZ(p) by the respective minute ventilation (VE) value using Equation (2).

$$SZ(p) = \sum\_{t=i}^{n} \frac{1}{2} (Z\_{t\_{i+1}}) + Z\_{t\_i})(t\_{i+1} - t\_i) \tag{1}$$

where *ti*+<sup>1</sup> − *ti* < 10 mins, and *SZ*(*p*) is the fully aggregated pollution value during the period from time *ti* to *tn*. This period is named *p*.

$$Exposure(p) = SZ(p) \* VE\tag{2}$$

where *VE* ∈ (*VER*, *VEW*, *VERT*, *VEC*)

> VER = VE for activity type "run";

VEW = VE for activity type "walk";

VERT = VE for activity types "rest" and "transport";

VEC = VE for activity type "cycle".


**Figure 4.** Aggregation of Indoor Pollution Values.

#### 3.2.5. Indoor and Outdoor Pollution Integration

The individual's location was tracked through the Moves mobile application. The recorded data from this tracking app include the starting and ending times, latitude, longitude, and activity type, as shown in Table 4. To obtain the corresponding pollution values for these periods, the time and location records were matched against the interpolated values from the created outdoor pollution map. The resulting outdoor pollution data were then matched against the aggregated indoor pollution values in Figure 4, in which the outdoor data were replaced by the corresponding indoor data.

Table 5 shows the resulting individual's indoor and outdoor PAPE values for PM2.5 with the respective period (i.e., starting and ending times), location (i.e., longitude and latitude), environment type (i.e., indoor or outdoor), activity type (i.e., transport, rest, walk, run, cycle), and minute ventilation (VE). The PAPE values are indicated in its last column "Exposure".

VE (m3/min) measures the volume of gas inhaled by an individual. It varies with the type of activity. The type of activity or travel mode may have a significant effect on the exposure values [83,84] and, hence, it is important to account for VE. We obtained the VE values from a study done by [85] on human inhalation rates. The types of activities in the tracking app include "transport", "walk", "run", and "cycle" and are based primarily on the speed of movement of the individual. In this study, for the time periods that lack one of these types of activity data, we assumed that the individual was at "rest" (i.e., sleeping, sitting, etc.). Since VE is based primarily on the body movement of the individual, we used the same VE values for both activity types "transport" and "rest".

#### 3.2.6. Practical Application

To illustrate a possible IoT application [86,87] that can be developed using the proposed framework, we identified different travel routes and their corresponding forecasted PAPE values [60] that give the individual an opportunity to select a travel route that minimizes the risk of exposure to pollution.

As an example, we selected an entry in Table 5 for the time period 12:04:30 to 12:23:44 on 24 March 2017, in which the individual was outdoors and in transport mode. During this selected period, by using the starting and ending location data that the tracking app provided, we identified alternative routes using the ggmap package in R.

From this package, the estimated travel time and route locations (i.e., latitude and longitude) were obtained. Then, based on these specific time and location data, the corresponding pollution values were taken from the previously interpolated outdoor pollution values.


**Table 4.** Data from Tracking App.

**Table 5.** Integrated Indoor and Outdoor Personal Air Pollution Exposure (PAPE).


#### **4. Results and Discussion**

#### *4.1. Outdoor Pollution Model Performance*

The adopted modeling technique based on geostatistics [65,66] using hourly-based data [43] from the fixed network of pollution stations can be interpolated by using different techniques, and criteria for technique selection is needed. Therefore, a cross validation with the leave out strategy was adopted. Based on the 5-fold cross validation performed for each of the three modeling techniques, among the 24 hourly datasets of PM2.5 captured on 34 March 2017, the Simple Kriging technique proved to be the

best model with a selection occurrence of 13, followed by the Ordinary Kriging with 10, Co-kriging with 7, and IDW with 1, out of 24 datasets. Our results agree with previous studies such as [88], where Simple Kriging outperformed Co-Kriging, and [89], in which Simple Kriging turned out to be the best model for estimating NO2 and PM10.

It can be argued that some local effects like turbulence around buildings, roughness of constructions, and some other aspects impact the accuracy of the estimation. The techniques used in such dimensions can be those related to the integration of meteorological, chemical and transportation numerical modeling (WRF and CMAQ models), with the limitations of being able to precisely estimate the boundary conditions as well as to properly model the city configuration (buildings, trees, surface properties, etc.). When running with high spatial resolution, they produce good results, although the quality is slightly reduced and numerical stability becomes an issue [90]. Another potential contribution could be to use artificial-intelligence-based models to estimate pollution levels. In these fields, the authors have already made significant contributions. Actually, some papers [91] have shown the competitive advantage of these methods over those based on numerical simulations. However, to keep the implementation interoperable and extendable, interpolation was finally adopted, because it can easily be enriched with the data fusion option based on IoT-based, street level pollution sensors.

#### *4.2. Device Performance*

To validate the fully aggregated indoor pollution values (SZ(p)) obtained from the indoor monitor and e-beacon devices, they were matched against the pollution data that were measured simultaneously during the study period using a portable air pollution monitoring tool- Atmotube [92] that was carried by the individual.

Figure 5 shows the indoor VOC values measured from the Atmotube and the Foobot monitor on 2017-03-31. It can be seen that there is significant measurement variance between the two devices. Nevertheless, the measured values follow the same trend. There is no consistent Air Quality Index (AQI) provided for comparing the pollution values measured by each device. In agreemen<sup>t</sup> with [93], the AQI scales differ across countries, organizations, and devices, and this presents an obstacle for comparison and invalidates its usability, which emphasizes the need for a standardized awareness procedure.

**Figure 5.** Indoor VOC Values from Atmotube and Foobot for 2017-03-31.

Variance in measurements can be attributed to the differences in calibration and measurement methods that were used by these sensors (see Figure 5). However, this situation partially unveils the observed difficulties in getting people aware of the real importance of pollution, as someone can exhibit different figures for the same pollutants at the same place and point in time. Actually, it is another strong point to have a common framework such as the one proposed in this paper, because it mainly fosters transparency and then allows interpolated or modeled values for outdoor pollution over time at a particular place to be compared with local, privately owned sensors from both outdoor, and indoor locations. From such observations where different local sensors can indeed participate, a better understanding about outliers and commonalities and trends can be derived.

#### *4.3. PAPE Values*

Figure 6 illustrates a color map of the average PM2.5 levels (μg/m3) for one day, in which the range of the specific values is presented on a color scale on the right. The location pins indicate the environment, activity type, time percentage (%), and the respective amount of PM2.5 (μg) that the individual was exposed to within the indicated time duration. It can be seen that the individual spent most of the day (62.78 %) outdoors (i.e., outside the workplace) on the northwest side of the city where the highest daily average pollution level of 12 μg/m<sup>3</sup> was concentrated, and this resulted in a total PM2.5 exposure of 52.7 μg.

Figure 7 shows the one day PM2.5 exposure levels by activity type. Based on this plot, the individual spent most of the day (88.13%) at rest and was exposed to approximately 70 μg of PM2.5 during this period. PM2.5 exposure values within a selected time period on the same day are also plotted in Figure 8, which shows that the individual had the highest pollution exposure at 15:32 in the afternoon during this selected period.

**Figure 6.** One Day PM2.5 Exposure Per Location, Activity Type, and Time Percentage.

From this analysis, the value of people being able to figure out the distribution of the total intensity of pollutants based on their activity becomes evident, as this method can make them aware of the real dimension of the problem and avoid classical myths, like the idea that most of the pollution is acquired outdoors (see Figure 6). While there are similar studies such as in [87], where the authors demonstrated the cleanest air routing algorithm for path navigation by calculating the PM2.5 exposure, they mainly focused on pollution acquired outdoors and not indoors.

Since information is the key aspect in having the opportunity to make proper decisions, the advantage of such an integrated framework that is able to integrate not only outdoor conditions but also indoor ones when available becomes more evident. This can also have an impact not only at the individual level by making everyone aware of their exposed pollution levels but at an aggregated level as well, because the public health dimension is impacted when buildings are seen as actionable regarding the indoor conditions. Therefore, KPIs can be adopted by considering the gradient between outdoor and indoor levels per area of occupancy of the buildings. By having systematic monitoring inside, the managemen<sup>t</sup> dimension can be adopted.

**Figure 7.** One Day PM2.5 Exposure by Activity Type Percentage.

**Figure 8.** Indoor PM2.5 Exposure Values Across Time. (DeltaT = 1 min).

#### *4.4. Alternative Travel Routes*

Similar to [87,94], another non-neglectable dimension that is possible to consider is the impact in terms of transportation decisions. Figure 9 shows different routes that one individual can take when moving from one location to another, and the corresponding aggregated pollution and exposure values are provided in Table 6. These values were predicted on the basis of the individual's activity data for 24 March 2017 from 12:04:30 to 12:23:44. The most frequent one adopted by the user was labeled "Actual", while the other potential routes were named A to C.

In this example, the better individual route will is B, as it causes the least amount of PAPE at 0.769 μg, which is 22.75% lower than the actual exposure of 0.995 μg. However, the decision process can be more complex, because there will certainly be some time duration uncertainties, which will consequently result in uncertainty about the total PAPE value of each alternative route.

5RXWH \$%&

**Figure 9.** Alternative Travel Routes.

Although most of the tools that give routing solutions for transportation problems are based on duration, some of them have the capability of filtering them out based on pollution exposure outdoors [87,94]. In terms of added value, this contribution enables alternatives to be ranked based on estimated pollution levels both outdoors and indoors, provided that pollution data is also available inside public transportation modes such as trains, buses, and subways. In these cases, as forecast for pollution is needed, machine-learning-based models that infer outdoor pollution values need to be used.


**Table 6.** Total PM2.5 and Exposure Values of the Different Routes..

## *4.5. Limitations*

Due to the lack of publicly available air quality information for other indoor areas such as shops, buses, cars, metros, etc., outdoor pollution data from the fixed-site outdoors information must be used in such cases. If there are more available resources, additional monitoring IoT devices in other indoor areas will provide greater accuracy. In most cases, good results demand good inputs, and existing data are replaced whenever better data become available. Quality improvements can be expected from those actions. Smart city empowered data sharing platforms such as IOTA Tangle [95] would boost IoT-based indoor air quality resource availability.

Accuracy for outdoor pollution estimation is another known limitation, both because of the time frequency resolution of available data and because of the interpolation errors. It would be possible to implement Weather Research and Forecasting (WRF) models such as the CMAQ. This decision requires significant effort, not only because of using the appropriate Digital Elevation Model (DEM) required to represent the landscape and building configuration, which is a complex task, but because it requires the boundary conditions to be realistic. This means adopting pressure and wind speed conditions for all surfaces external to the volume of interest. These situations need to be updated regularly throughout the day, as environmental conditions change as well. Indeed, numerical stability conditions must be carefully managed in this case as well.

For future applications, the best solution for environments will come from both the increasing deployment of dense (e.g., street-level) IoT-based air quality sensors and the prosperity of the data sharing platform, which can increase the available data and, consequently, will increase the accuracy.
