Development of Real-Time IoT-Based Air Quality Forecasting System Using Machine Learning Approach

Yildiz, Onem; Sucuoglu, Hilmi Saygin

doi:10.3390/su17198531

Open AccessArticle

Development of Real-Time IoT-Based Air Quality Forecasting System Using Machine Learning Approach

by

Onem Yildiz

¹

and

Hilmi Saygin Sucuoglu

^2,*

¹

Department of Electrical and Electronics Engineering, Aydin Adnan Menderes University, Aydin 09100, Türkiye

²

Department of Mechanical Engineering, Aydin Adnan Menderes University, Aydin 09100, Türkiye

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(19), 8531; https://doi.org/10.3390/su17198531

Submission received: 24 August 2025 / Revised: 19 September 2025 / Accepted: 19 September 2025 / Published: 23 September 2025

(This article belongs to the Special Issue Achieving Sustainability in New Product Development and Supply Chain)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Air quality monitoring and forecasting have become increasingly critical in urban environments due to rising pollution levels and their impact on public health. Recent advances in Internet of Things (IoT) technology and machine learning offer promising alternatives to traditional monitoring stations, which are limited by high costs and sparse deployment. This paper presents the development of a real-time, low-cost air quality forecasting system that integrates IoT-based sensing units with predictive machine learning algorithms. The proposed system employs low-cost gas sensors and microcontroller-based hardware to monitor pollutants such as particulate matter, carbon monoxide, carbon dioxide and volatile organic compounds. A fully functional prototype device was designed and manufactured using Fused Deposition Modeling (FDM) with modular and scalable features. The data acquisition pipeline includes on-device adjustment, local smoothing, and cloud transfer for real-time storage and visualization. Advanced feature engineering and a multi-model training strategy were used to generate accurate short-term forecasts. Among the models tested, the GRU-based deep learning model yielded the highest performance, achieving

R^{2}

values above 0.93 and maintaining latency below 130 ms, suitable for real-time use. The system also achieved over 91% accuracy in health-based AQI category predictions and demonstrated stable performance without sensor saturation under high-pollution conditions. This study demonstrates that combining embedded hardware, real-time analytics, and ML-driven forecasting enables robust and scalable air quality management solutions, contributing directly to sustainable development goals through enhanced environmental monitoring and public health responsiveness.

Keywords:

air quality forecasting; environmental monitoring; Internet of Things (IoT); machine learning; prototyping; real-time prediction; sustainable development

1. Introduction

In recent years, deterioration in air quality has emerged as a serious environmental and health problem not only in highly industrialized countries, but also in many less developed countries [1]. Scientific studies clearly show that poor air quality triggers serious health problems in individuals from all segments of society—regardless of age and social status [2,3]. Air pollution not only negatively affects human health, but also the natural environment [4]. This pollution is a complex mixture of harmful particles and gases, sometimes exacerbated by increased levels of ultraviolet (UV) radiation [5].

Today, the Internet of Things (IoT) and cloud technologies, supported by extensive wireless sensor networks, play an important role in the construction of smart cities, which are seen as a key component of sustainable and livable cities. However, these technological solutions alone are not enough to eliminate environmental problems. In line with what Rickenbacker and colleagues define as “environmental awareness”, the effectiveness of these systems should be supported by coordinated social efforts to raise awareness in all segments of society [6].

The air in the atmosphere is polluted by a complex mixture of solid particles of various sizes, liquid droplets and chemical gases. The residence time of these pollutants in the atmosphere can vary from a few hours to several years. This duration is directly related to the chemical reactivity of the pollutant and its ability to interact with other components in the atmosphere. In addition, meteorological variables such as temperature, relative humidity, solar radiation, precipitation and wind speed also have a decisive influence on the distribution and solubility of pollutants [7].

Air pollutants may originate from both man-made and natural sources. Human activities such as motor vehicles, industrial production facilities, power plants using fossil fuels are common sources of pollutants, while forest fires, dust storms and volcanic eruptions are examples of pollutants of natural origin. Primary pollutants are substances released directly from the source into the atmosphere, such as particulate matter (PM), carbon monoxide (

C O

), nitrogen dioxide (

{N O}_{2}

), sulphur dioxide (

{S O}_{2}

) and lead (

P b

). In contrast, secondary pollutants are formed as a result of chemical reactions in the atmosphere and are usually detected downwind, away from the source. Ozone (

O_{3}

) and some types of particulate matter belong to this group. Examples include ozone (

O_{3}

) and secondary particulate matter such as sulfates, nitrates and organic aerosols.

Air pollution varies widely spatially and temporally, depending on pollutant type, emission source and meteorological conditions [5]. Therefore, continuous monitoring and recording of pollutant levels is critical to effectively combat air pollution. The United States Environmental Protection Agency (EPA) has identified six key pollutants as “criteria pollutants” for their health and environmental impacts [8]. These criteria pollutants are particulate matter (PM), carbon monoxide (

C O

), sulfur dioxide (

{S O}_{2}

), nitrogen dioxide (

{N O}_{2}

), lead (Pb) and ozone (

O_{3}

).

For public health communication, continuous AQI values are commonly mapped into health-based categories defined by CPCB/US-EPA breakpoints (Table 1). These categories (e.g., Good, Moderate, Unhealthy, Hazardous) provide an intuitive framework to assess air quality impacts.

In today’s rapidly urbanizing world, monitoring and predicting air quality is critical for public health. The integration of low-cost sensors and IoT technologies is becoming increasingly important in the development of air quality monitoring systems. Many studies have combined these technologies to develop effective, real-time air quality monitoring solutions for both indoor and outdoor conditions [10,11,12,13]. For example, thanks to portable systems that monitor pollutants such as

C O

,

{C O}_{2}

, PM2.5 with MQ series sensors integrated into embedded systems such as Arduino or Raspberry Pi, harmful gases can be monitored instantaneously, data can be displayed on LCD screens and transferred to cloud platforms [14,15]. Some studies conducted in this context reveal that their systems have been successfully tested especially in indoor spaces such as kitchens or in urban centers and contribute to social awareness [16,17,18].

Similarly, studies on the development of cloud-based systems have shown that sensor data can be optimized with PID-controlled dryers to make them more reliable [18]. Collecting measured parameters at the second level and transferring them to the system with minute averages has made it possible to analyze air quality with high resolution. The integration of these systems with platforms such as IBM Bluemix and ThingSpeak allows users to access data via mobile and web-based interfaces [15,19,20].

LoRa and LoRaWAN-based systems developed in response to the need for long-range communication in outdoor conditions stand out with their advantages of low energy consumption and wide coverage. In systems developed using LoRaWAN, solar-powered sensor nodes monitor pollutants such as

{N O}_{2}

,

{S O}_{2}

,

{C O}_{2}

as well as environmental variables such as temperature and humidity [21,22]. The data obtained by visualizing these systems through open platforms such as Grafana and ThingSpeak provide high accuracy rates compared to measurements made with professional equipment [17,23]. In addition, IoT-based early warning systems used in forest areas provide effective solutions for the detection of gases (e.g., methane, hydrogen) that pose a fire risk [24].

In addition to all these applications, some reviews in the literature focus on evaluating the performance and reliability of low-cost sensors. In particular, it has been reported that electro-chemical and metal oxide-based sensors can be affected by environmental factors and can cause deviations in results when used without calibration [25]. However, it is also emphasized that, with proper calibration and data processing techniques, these sensors can provide data close to the accuracy of reference systems [26].

Systematic reviews assessing indoor air quality show that IoT-based solutions are most commonly configured to monitor parameters such as temperature, humidity,

{C O}_{2}

,

C O

and PM2.5 [23,27,28]. Although the majority of these systems are low-cost, it is noteworthy that a significant proportion do not provide calibration data. This indicates that users need to be more aware of sensor selection and system reliability [27].

Air quality monitoring and prediction has made significant progress in recent years thanks to the integrated use of machine learning (ML) techniques with environmental data. In numerous studies, different algorithms have been tested to predict Air Quality Index (AQI) values using air pollutants such as PM2.5,

{N O}_{2}

,

C O

and

O_{3}

and meteorological variables. In particular, Support Vector Regression (SVR) [29,30], Random Forest (RF) [31,32] and ensemble-based methods [33] have been frequently preferred in AQI prediction tasks.

The SVR algorithm has been frequently used in regression-based estimation of AQI and has yielded successful results in various regions. In California [29] and New Delhi [30], the RBF (Radial Basis Function) kernel function provided the highest accuracy. However, Liang et al. [33] reported that the performance of Support Vector Machine (SVM) degrades especially for long-term (24 h) forecasts. Recent work with Quantum SVM (QSVM) provides accuracies beyond classical SVM and reduces computation time [34].

The Random Forest algorithm shows high performance, especially when used with satellite data. In the study for the Amman-Zarqa region in Jordan,

C O

and

{N O}_{2}

prediction was achieved with an

R^{2}

of over 94% [32]. Similarly, the Random Forest model has also achieved high success in

{N O}_{x}

forecasting in Italy [31].

Extensive comparisons show that ensemble models (especially stacking and AdaBoost) offer stable and high performance in short-term AQI forecasts. Liang et al. [33] reported that stacking gives the superior performance for

R^{2}

and root mean square error (RMSE) values for forecasts, while AdaBoost provides the best results for mean average error (MAE).

Among deep learning models, Long-Short Term Memory (LSTM) and Gated Recurrent Unit (GRU)-based architecture are prominent in time-series forecasting. The LSTM model applied in Chennai provided high accuracy in short-term AQI forecasts by feeding with attributes obtained with Gray Level Co-occurrence Matrix [35]. In studies on the prediction of PM2.5 and PM10 levels with LSTM, hyperparameter optimizations with meta-heuristic algorithms such as Grey Wolf Optimization (GWO), Genetic Algorithm (GA) and Whale Optimization Algorithm (WOA) significantly reduced prediction errors [36]. Moreover, the Autoregressive Integrated Moving Average (ARIMA) + GRU-based hybrid model [37] obtained better results compared to classical models.

Some studies have also focused on removing anomalies and deficiencies in the datasets used in the forecasting process. The AIrSense system developed by Rollo et al. repairs sensor data with three different anomaly detection algorithms and improves data quality before calibration [38]. In the classification of unbalanced data classes, scalable kernel-based SVM and Adjusting Kernel Scaling (AKS) methods were used to achieve over 99% accuracy [39].

All these studies reveal that the performance of ML models developed for AQI prediction varies depending on the target region, data type, selected attributes and model configuration. Moreover, the development of IoT systems and data collection infrastructure paves the way for these models to be used more widely in real-time applications [40,41]. Recent advances in intelligent gas detection demonstrate the synergy of nanostructured transducers and machine learning algorithms.

Yang et al. reported a wafer-scale hydrogen sensor that pairs a vertical thermal conduction architecture with a neural network predictor to achieve a 0.4 s response while consuming only 20 mW [42]. Complementing this device-level breakthrough, Zong et al. reviewed the integration of AI, flexible substrates and IoT connectivity in “smart gas sensors”, outlining challenges and opportunities for real-time environmental monitoring [43]. These studies underline the field’s transition toward data-driven, self-calibrating architecture, a direction also pursued in the present work.

In this study, we propose a real-time, low-cost, and highly accurate system based on IoT and machine learning, which are frequently highlighted in the literature for air quality monitoring and forecasting approaches. The developed system is equipped with gas sensors that continuously monitor temperature, humidity and multiple pollutants such as particulate matter (PM), carbon dioxide (

{C O}_{2}

), carbon monoxide (

C O

) and volatile organic compounds (VOCs). Thanks to the microcontroller-based hardware infrastructure, data is collected, stored on the cloud platform and analyzed in line with time-series features. In the forecasting phase, short-term AQI forecasts are realized with high accuracy using ensemble methods and deep learning models. Compared to traditional statistical methods, the proposed architecture provided similar or higher accuracy at lower computational cost and showed significant potential for environmental monitoring, public health warning systems and smart city applications, especially in urban environments. In this respect, this study is a contribution that demonstrates the applicability of environmental IoT systems and offers a holistic approach to air quality prediction.

Earlier studies have moved the needle either on sensor longevity—e.g., drift-hardened PM or MOX arrays that survive months in the field [13,14]—or on data-driven forecasting, where neural or tree-based models predict pollutant concentrations or AQI a few hours ahead [25,26,27]. Yet, to our knowledge, no work unifies these strands into a single, self-adjusting node that delivers categorical AQI forecasts and is validated for a humid Mediterranean coastal setting. This paper fills that void by introducing a low-power multi-gas platform with on-device adjustment, coupling it to an end-to-end, nightly retrainable machine learning pipeline, and demonstrating > 90% class-level accuracy over a year-long deployment in İzmir, Türkiye. The resulting dataset and open pipeline extend the geographic and methodological reach of intelligent AQI monitoring.

Despite significant advances in AQI prediction, most existing studies emphasize long-term forecasting or rely solely on regulatory-grade instruments, leaving a gap in short-term, real-time applications based on calibrated low-cost sensors. This study addresses that gap by combining sensor calibration with hybrid ensemble and recurrent learning models, thereby enabling robust short-horizon AQI forecasting with reduced latency.

The remainder of this paper is organized as follows: Section 2 describes the overall system architecture, including hardware design, sensor configuration, data acquisition mechanisms and the predictive model development process. Section 3 presents experimental results, including forecast accuracy metrics, performance analysis and operational evaluation of the prototype device. Finally, Section 4 concludes the paper with key takeaways and outlines potential directions for future research and deployment.

2. Materials and Methods

2.1. System Architecture of Air Quality Monitoring and Forecasting Device

The developed system is designed as a fully functional Internet of Things (IoT)-based architecture, enabling real-time environmental data acquisition, processing and transmission. The device integrates multiple low-cost gas and climate sensors with a microcontroller unit (Raspberry Pi 5), forming the edge layer of the IoT stack. Sensor data are collected at regular intervals, pre-processed locally and transmitted via Wi-Fi to a cloud-based server for storage, visualization and analysis. This architecture ensures seamless connectivity between hardware components, cloud infrastructure and forecasting algorithms, enabling real-time and scalable air quality monitoring aligned with IoT principles. The term “low-cost IoT sensors” refers to devices that are compact, commercially accessible, widely adopted in IoT-based environmental monitoring and capable of providing reliable measurements after calibration despite their simplicity and low power consumption. These attributes distinguish them from reference-grade sensors, while making them practical for scalable real-time deployments. The selected sensors are widely used in IoT-based air quality monitoring and operate within ranges sufficient to cover all relevant AQI brackets. Their nominal specifications, including measurement spans and operating limits, are summarized in Table 2.

In this study, within the scope of the air quality monitoring and prediction system, literature-supported decision processes were followed in the selection of sensors to be used in pollutant measurement. In particular, pollutants that have a direct impact on human health such as PM2.5,

C O

,

{C O}_{2}

and VOCs were taken into consideration. GP2Y10 Dust Sensor module, which is among the widely preferred sensors in the literature, was used for particulate matter measurement. GP2Y10 Dust Sensor stands out with its pre-calibrated structure, low cost and high accuracy in the measurement range of 0–600 μg/m³. It also generates data in accordance with EPA standards for monitoring particles smaller than 2.5 μm. The Sharp GP2Y10 sensor, while LED-based, was retained for its cost advantage; a site-specific two-point calibration against the reference station reduced RMSE to 6 µg/m³ in the 0–50 µg/m³ band, which is adequate for the AQI categories considered here. Thanks to these features, it is considered as a suitable alternative for both indoor and outdoor air quality monitoring [48,49].

As gas sensors, MQ-7 for carbon monoxide and Gravity ENS 160 for multi-gas detection were preferred. The MQ-7 sensor has found wide use in the literature with its capacity to detect

C O

. Its low cost, compact design and adaptability to a wide range of operating conditions make it suitable for urban and rural areas. However, in situ calibration is recommended for MQ-series sensors as accuracy information is not clearly provided in the manufacturer’s data and has a direct impact on system performance [50,51]. For this reason, in order to increase the accuracy of the MQ-7 sensor used in the study under field conditions, calibration procedures were carried out in an open-air environment. In addition to all these, the Gravity ENS160 sensor is a digital multi-gas sensor designed specifically for indoor air quality monitoring and used to measure VOC and e

{C O}_{2}

(equivalent

{C O}_{2}

), which shows highly similar dynamics with NDIR (nondispersive infrared)

{C O}_{2}

. Volatile organic compounds (VOCs) are much more common in indoor environments than outdoor environments. This is due to the abundance of biological waste such as odors from respiration, sweating or human metabolism, and building materials such as furniture, among others, in the environment. e

{C O}_{2}

values also change in proportion to VOC and play an important role in determining the air quality index, especially in indoor environments. For this reason, ENS160 was selected to have a more accurate output.

The DHT11 sensor was used to monitor temperature and humidity parameters. DHT11 can operate in the temperature range of 0–50 °C and relative humidity range of 20–90% and provides sufficient sensitivity for monitoring basic indoor comfort parameters. Although alternatives such as DHT22 or SHT21 have been proposed for applications requiring higher sensitivity [52,53,54,55], DHT11 was preferred in this study considering cost-effectiveness and system simplicity.

The Air Quality Index (AQI) is calculated in accordance with the guidelines provided by both the Indian Central Pollution Control Board (CPCB) and the United States Environmental Protection Agency (US-EPA). The AQI is determined individually for each pollutant, and the highest (i.e., the worst) sub-index among them represents the overall AQI value. Each sub-index is computed using a linear interpolation based on the pollutant’s concentration within defined breakpoint intervals.

The sub-index (I_p) for a given pollutant is calculated using the following Equation (1):

I_{p} = (\frac{I_{H i} - I_{L o}}{{B P}_{H i} - {B P}_{L o}}) (C_{p} - {B P}_{L o}) + I_{L o}

(1)

where:

I_p: Sub-index value for pollutant p;
C_p: Observed concentration of pollutant p;
BP_Hi: The breakpoint concentration greater than or equal to C_p;
BP_Lo: The breakpoint concentration less than or equal to C_p;
I_Hi: AQI value corresponding to BP_Hi;
I_Lo: AQI value corresponding to BP_Lo.

The first step in the designed air quality monitoring system was to connect the sensors to the Raspberry Pi 5 microcontroller. The DHT11 and ENS160 sensors were linked directly to the microcontroller, while the MQ-7 and GP2Y10 sensors were connected to the Raspberry Pi via the ADS1115 ADC module. Figure 1 provides a concise view of our air quality monitoring and forecasting system: Figure 1a outlines the complete hardware stack, Figure 1b details the individual components and their on-board connections and Figure 1c traces the end-to-end data pipeline from raw sensing to cloud-based AQI forecasting.

For establishing reliable ground truth values, all sensor measurements were paired with a certified reference station. In this study, we used an EN 15267-certified TSM-17 air quality monitoring station located in İzmir, Türkiye [56]. This station continuously records pollutant concentrations (PM2.5,

C O

,

{C O}_{2}

, among others) under European Union certification standards and is widely recognized for its accuracy and long-term stability. The Air Quality Index (AQI) target variable employed in our machine learning models was derived directly from synchronous measurements of this reference station, thereby ensuring that training and evaluation were based on standardized and validated pollutant concentrations.

2.2. Structure Design of the Air Quality Monitoring and Forecasting System Device

The air quality monitoring and forecasting system was designed using Parametric Solid Modeling techniques. The design process encompassed the development of enclosure components, hardware sub-assemblies and structural support elements. This included the systematic creation of sub-assemblies, the accurate placement of electronic and mechanical components and the formulation of the complete assembly model. The adopted methodology served as an effective framework for subsequent prototyping and physical integration stages.

Special attention was given to the constraints and requirements of Fused Deposition Modeling (FDM), particularly in terms of dimensional tolerances, friction fit considerations and printability of the designed components. Upon completion of the modeling phase, essential technical documentation such as the part list and exploded view, and detailed engineering drawings were generated. Figure 2 illustrates the final assembly model, the corresponding part list and the exploded view and engineering drawing.

2.3. Prototyping and Assembly of the Device

The enclosure components, hardware mounting parts and connection elements of the air quality monitoring and forecasting device were produced using the Fused Deposition Modeling (FDM) additive manufacturing method. To ensure adequate structural stiffness and mechanical strength, a 50% infill density was employed during the 3D printing process. The parts were printed using Hyper PLA material, selected for its fast-processing capability and suitability for prototyping.

A layer height of 0.2 mm was chosen to achieve a balance between printing speed and surface quality. Upon completion of the printing phase, all mechanical and electronic components were assembled using detachable connection mechanisms, allowing for modularity, ease of maintenance and potential reconfiguration. The final prototype of the device is illustrated in Figure 3.

In addition to the physical prototype, a graphical user interface (GUI) was developed to facilitate real-time visualization and user interaction with the air quality monitoring system. As shown in Figure 4, the GUI provides intuitive dashboards displaying live sensor readings, historical AQI trends and pollutant-specific indicators; the overall air-quality status is shown as a numeric code—a compact UBA-style [57] labeling that corresponds one-to-one with the EPA breakpoints [9] and is explained in detail in Section 3.2. This interface enhances system accessibility, supports non-technical users and enables more informed environmental awareness by translating complex data into actionable insights. The GUI is integrated with the cloud-based database and updates dynamically in synchrony with the data transmission pipeline.

2.4. Data Acquisition

This section details the architecture and processes involved in sensor data acquisition, transmission and pre-processing, which are essential for ensuring reliable and high-quality inputs to the forecasting models.

All measurements were collected in İzmir, Türkiye (38.45° N, 27.10° E) using four sensor nodes installed 10–25 m above ground. Data span an uninterrupted 12-month period (1 January–31 December 2024; 8700 hourly samples per node). Four identical sensor nodes were distributed along a 4 km west–east transect—(i) university rooftop, (ii) six-lane arterial roadside, (iii) low-traffic residential street and (iv) seafront park edge—and logged continuously from 1 January to 31 December 2024 (8700 hourly records per node). One node was also placed indoors for a 48 h office validation to cross-check the ENS160 e

{C O}_{2}

output against an NDIR reference. All error statistics in Section 3 are calculated by pairing each node’s hourly reading with the synchronous measurement from the collocated EN 15267-certified TSM-17 reference station.

2.4.1. Sensor Network and Sampling

The prototype integrates four low-cost transducers that jointly capture the parameters most frequently highlighted in recent IAQ/ambient-AQ studies: GP2Y10 for PM2.5, MQ-7 for

C O

, Gravity ENS160 for TVOCs/e

{C O}_{2}

and DHT11 for temperature–humidity monitoring. All sensors are polled once every 60 s by a Raspberry Pi 5; the MQ-7 and GP2Y10 outputs are digitized through an ADS1115 16-bit ADC to increase resolution, while ENS160 and DHT11 communicate via I²C/GPIO, respectively. The 1 min cadence provides a balance between temporal granularity (needed for short-term forecasts) and on-board power-compute constraints. The VOC and e

{C O}_{2}

outputs, intended primarily for indoor ventilation assessment, are retained in the outdoor data stream only for completeness; they are excluded from AQI calculations and serve solely as advisory flags when extreme excursions (e.g., solvent release in an enclosed bus bay) occur.

Independent field campaigns have confirmed that each transducer in the proposed node remains decision-grade over multi-month deployments. A ten-month rooftop study involving 16 Sharp GP2Y1010 units reported a reversible mean bias drift of +4 µg/m³ after routine compressed-air cleaning, well within the U.S. EPA Class III equivalence margin for PM2.5 sensors [58]. For

C O

measurement, Kobbekaduwa et al. showed that MQ-7 elements retained ≥95% sensitivity and <±5% span drift after 180 h high-temperature cycling, provided a two-point recalibration was performed quarterly [59]. Finally, recent MOX arrays that embed ageing-compensation algorithms—exemplified by the BME/ENS-class TVOC sensors—maintained concentration error below ±2% during a 12-week indoor trial [60]. These convergent findings support the assumption that, under the maintenance schedule adopted here, the sensor suite can supply reliable AQI data for year-scale deployments.

2.4.2. Edge-Cloud Data Path

As Figure 1 already depicts the hardware topology, on the software side, raw readings are time-stamped in JSON and pushed over Wi-Fi to a Docker-hosted InfluxDB instance. Each record is tagged with sensor-id, location and firmware revision to simplify versioning and downstream analytics.

2.4.3. On-Device Calibration

To mitigate well-documented drifts in metal oxide and optical sensors, an in situ routine precedes deployment:

MQ-7—24 h outdoor exposure, followed by two-point $C O$ calibration (0 ppm and 50 ppm span);
GP2Y10—zero-check with HEPA-filtered air and span verification at 200 µg/m³ synthetic dust;
ENS160—factory-supplied calibration curves; cross-checked against a benchtop NDIR ${C O}_{2}$ meter.

Calibration coefficients are stored in non-volatile memory and appended to every payload.

Detailed calibration curves, coefficients and comparative pre/post plots for the MQ-7 and GP2Y10 sensors are provided in Figure 5 and Figure 6, demonstrating the effectiveness of the two-point adjustment against the EN 15267-certified TSM-17 reference station.

2.4.4. Pre-Processing Pipeline

The pre-processing pipeline can be divided into four different parts as follows:

Missing values: forward-fill (LOCF) for ≤5 consecutive samples; longer gaps are flagged;
Outlier removal: Z-score filtering (|z| > 3);
Smoothing: 5 min centered moving average to attenuate sensor noise;
Feature synthesis: pollutant-specific sub-indices computed with the CPCB/US-EPA break-point formula; the maximum sub-index becomes the target AQI label.

2.5. Training Strategy

In this section, the methodological framework for model training is described, encompassing feature engineering, model selection and optimization procedures. In this study, AQI was first predicted as a continuous variable using regression models. These continuous outputs were then discretized into standard AQI categories (Good, Moderate, Unhealthy, etc.) according to CPCB/US-EPA breakpoints. Classification performance metrics were therefore derived from discretized regression outputs rather than from a separately trained categorical model.

Hourly time-aligned streams of PM2.5,

C O

,

{C O}_{2}

, TVOC, temperature and relative humidity are first cleaned and gap-filled (Section 2.5.1 and Section 2.5.2). For each record, we construct an eight-feature vector—the six raw measurements plus hour-of-day and day-of-week flags—while the AQI category derived from co-located reference concentrations serves as the label. The dataset is split chronologically into 70% training, 15% validation and 15% test subsets to prevent future-to-past leakage. Inputs are z-score-scaled for tree/boosting models and Min-Max-scaled for recurrent networks. Hyper-parameters are tuned via a five-fold, blocked time-series cross-validation wrapped in an Optuna Bayesian search (~300 trials per model class) that jointly minimizes validation RMSE and maximizes

R^{2}

. The best configuration is promoted to production, and all artefacts—parameter grids, trained weights, scalers and metrics—are version-controlled in MLflow, enabling fully reproducible nightly retraining.

2.5.1. Feature Engineering

For each pollutant and meteorological channel, we derive the following:

Lag terms t − 1, t − 3, t − 6 h (capturing short-range persistence);
Rolling statistics over 1 h (mean, median, variance);
Day-of-week and hour-of-day one-hot encodings (capturing periodicity).

2.5.2. Model Portfolio

To capture both the nonlinear tabular relationships revealed by engineered features and the temporal correlations inherent in raw sequences, we selected a balanced quartet of algorithms. Two belong to the tree/boosting family, offering high interpretability and fast inference on edge hardware, while the other two are recurrent neural networks capable of learning long-range dependencies directly from time-ordered data. This mix reflects the methods most frequently reported as top AQI forecasters in the recent literature and provides a clear basis for ablation between classical ensembles and deep learning approaches. Table 3 summarizes the four candidate learners evaluated in this study—two tree-based ensembles (Stacking RF + SVR + XGB and AdaBoost.R2) and two recurrent networks (LSTM and GRU). For each model, we list the underlying rationale and the Optuna-defined hyper-parameter ranges explored during the five-fold blocked time-series search. These search spaces produced, on average, 300 trials per learner; the configurations that minimized validation RMSE are reported in the Results Section. All implementations rely on scikit-learn 1.5 or Keras/TensorFlow 2.19 and are version-controlled in MLflow, ensuring full reproducibility.

2.5.3. Training Pipeline

To transform the cleaned sensor streams into deployable forecasters, we adopt a six-stage, fully automated workflow. Each supervised instance is first defined by a one-hour-lagged feature vector—raw PM2.5,

C O

,

{C O}_{2}

, TVOC, temperature, relative humidity and the hour-of-day and day-of-week flags—while the prediction target is the AQI category at the forecast hour, obtained from co-located reference concentrations via the breakpoint scheme described in Section 2.1; previously computed AQI values are deliberately excluded to avoid label leakage. First, the time-ordered dataset is divided chronologically into 70% training, 15% validation and 15% test segments so that no future information leaks into the past. All features are then normalized—tree/boosting models receive z-score scaling, whereas the recurrent networks are fed Min-Max-scaled inputs—to keep gradient magnitudes and feature importances comparable across channels.

Next comes hyper-parameter exploration: a five-fold, blocked time-series cross-validation loop is wrapped in an Optuna Bayesian search that typically evaluates 300–400 trials per model class. Throughout this loop, we optimize a composite objective that minimizes validation RMSE while simultaneously maximizing

R^{2}

. Once the search converges, the candidate achieving the lowest RMSE (and, in a tie, the highest

R^{2}

) is promoted to the production slot. All artefacts—parameter grids, trained weights, scaler objects and performance metrics—are version-controlled in MLflow, giving the pipeline nightly re-training capability without sacrificing reproducibility or auditability. For clarity, Table 4 provides a complete summary of the input features used for training the machine learning models, along with the target label (AQI category) derived from the certified reference station. The ENS160 sensor was not included in the AQI calculation, as its composite output does not correspond to pollutant-specific breakpoints defined by US-EPA. Instead, it was employed as an auxiliary alarm channel to detect sudden changes in volatile compounds, thereby complementing the AQI framework without altering its regulatory definition.

3. Results and Discussion

3.1. Forecast Accuracy on Short Horizons

Table 5 reports the mean absolute error (MAE), root mean square error (RMSE) and coefficient of determination (

R^{2}

) achieved by the four candidate models for 30 and 60 min look-ahead horizons. The recurrent models (LSTM, GRU) edged out the classical ensembles by approximately 2–3% on MAE while remaining comfortably within the 200 ms inference-time budget described in Section 2.5.3.

The forecasting matrix contains 34,800 hourly records gathered from four sensor nodes (8700 samples × 4 sites). Each record comprises eight predictors—raw PM2.5,

C O

,

{C O}_{2}

, TVOC, temperature, relative humidity, hour-of-day and day-of-week—and one categorical label with five AQI classes. Class counts are Good (15,660; 45%), Moderate (9396; 27%), Unhealthy-for-Sensitive-Groups (5220; 15%), Unhealthy (3480; 10%) and ≥Very Unhealthy (1044; 3%). Point-biserial correlations with the integer-encoded AQI label indicate that PM2.5 is the strongest single predictor (ρ = 0.71), followed by

C O

(ρ = 0.46) and TVOC (ρ = 0.38), while meteorological variables contribute mainly through interactions with the temporal flags. These statistics set the dimensionality, class balance and predictor relevance context for the forecasting results reported in Table 5. Unless stated otherwise, we report metrics on the pooled test set, formed by concatenating the held-out hourly node–reference pairs from all four sites. Training/validation/test splits remain strictly chronological within each site; pooling is applied only for reporting so that all locations contribute simultaneously to the final figures.

In addition to the statistical metrics in Table 5, Figure 7 presents scatter plots of actual vs. predicted AQI values at the 60 min horizon for all four candidate models. Each plot includes a regression line (y = α + βx) and the 1:1 reference. The slopes (β) highlight that the GRU and LSTM models remain close to unbiased, while ensemble methods tend to underestimate AQI.

In addition to scatter plots, Figure 8 displays time-series traces of actual and predicted AQI over a representative 7-day window at the 60 min horizon. The recurrent models (LSTM/GRU) closely track diurnal peaks and troughs, whereas the ensemble methods exhibit reduced peak amplitude and mild phase lag, especially during morning/evening traffic hours. These observations are consistent with the numerical results in Table 3 and further support the GRU model’s superior temporal fidelity.

For ventilation control and early warning—tasks that rarely require predictions beyond 60 min—the GRU strikes the best balance between error, latency and model size (≈1.4 MB). The superior performance of the GRU can be attributed to its gated recurrent architecture, which captures temporal dependencies more efficiently than classical ensemble learners while requiring fewer parameters than LSTM networks. This efficiency reduces overfitting risk and enhances learning of short- to medium-horizon patterns, which are particularly important for AQI prediction. It should be noted that the validation presented here was performed intra-site, with training/validation/test splits confined to the same station. Broader cross-site or multi-regional evaluations remain as future work to confirm generalizability across heterogeneous environments.

3.2. Agreement with Health-Based AQI Categories

To gauge public-facing utility, predictions were discretized into the six health breakpoints listed in Table 1. Categorical accuracy exceeded 91% for “Good” and “Moderate” bands and 88% for “Unhealthy” bands. Crucially, false negatives in the “Hazardous” band never exceeded 2%—well below the 5% threshold recommended by US-EPA guidance. This demonstrates that the model does not underestimate severe events, a critical safety requirement.

In addition to the class-wise metrics reported in Table 1, Figure 9 shows the confusion matrix of the GRU model. The diagonal dominance confirms that most predictions fall into the correct AQI category, with only minor confusion between adjacent classes such as “Unhealthy for Sensitive Groups” and “Unhealthy”.

Class-wise performance metrics are summarized in Table 6; the macro-F1 (0.83) and weighted-F1 (0.90) confirm that the overall 91% accuracy is not driven solely by majority classes. Values are averaged over the 5-fold cross-validation.

3.3. Influence of Sensor Selection

The manufacturer ranges in Table 2 show that all transducers span at least two full AQI brackets beyond local regulatory limits. That margin, combined with the on-device two-point calibration (Section 2.4.3), prevents hard saturation during high-pollution spikes—an issue that hampered Kurnia et al. [61], who used an SDS011 optical counter capped at 500 µg/m³ and reported 23% clipping in “Very Unhealthy” episodes. No clipping was observed in our trials, eliminating the need for synthetic data augmentation. This absence of clipping is consistent with the manufacturer-specified ranges listed in Table 2, which ensured that all sensors operated within their linear domain throughout the deployment.

3.4. Comparison with Recent Literature

Despite using commodity sensors and an embedded CPU, the proposed pipeline matches or outperforms workstation-grade models when evaluated at the same 60 min horizon. The gain stems from the richer multi-gas feature set (

C O

,

{C O}_{2}

and VOC sub-indices from Table 1 and Table 7) and Bayesian hyper-parameter search (Section 2.4.3) that yielded deeper yet less over-fitted forests and optimally sized recurrent layers.

Table 8 underscores how the proposed pipeline remains competitive, often superior, when placed side-by-side with the latest peer-reviewed studies. Das et al. [62] reported a 1 h RMSE of 15.2 µg/m³ using Random Forests on fixed monitoring stations; our system delivers 13.8 µg/m³ despite running on a low-cost, mobile sensor node—demonstrating that hardware constraints do not inevitably degrade forecast accuracy. Zhou et al. [63] achieved 12.7 µg/m³ with an attention-augmented LSTM, but only under GPU acceleration; by contrast, our GRU/LSTM models achieve <130 ms edge-latency on a Raspberry Pi 5, making real-time deployment practical. Kurnia et al. [61] suffered 23% data clipping because their SDS011 sensor saturates above 500 µg/m³; thanks to the broader operating ranges listed in Table 2, no saturation was observed in our experiments.

These comparisons highlight how careful model–hardware co-design—lightweight seq-to-one RNNs converted to TensorFlow-Lite, an enriched multi-gas feature set, and Bayesian hyper-parameter optimization—reproduces lab-grade error levels on resource-constrained devices, and in some cases improves upon them. The result is both a cost-effective platform and a blueprint for scalable, real-time early-warning networks that can be rolled out across heterogeneous urban micro-climates.

3.5. Major Contributions and Insights

To provide a comprehensive evaluation of the system’s deployment feasibility, this section outlines the key operational strengths and practical limitations observed during prototyping and real-time testing.

Real-time operability—Edge latency (<130 ms) enables minute-by-minute updates without throttling the Raspberry Pi 5, meeting WHO dashboard guidelines;
Hardware resilience—Sensor ranges (Table 2) comfortably exceed worst-case urban episodes, preventing saturation and false reassurance during wild-fire smoke events;
Energy footprint—GRU inference adds <150 mW to the 4 W node budget, permitting solar-cell or battery operation;
Scalability—Because only a 1.4 MB TFLite artefact is hot-swapped nightly, dozens of nodes can be updated simultaneously over LoRaWAN back-haul.

Two caveats remain: the first, accuracy degrades beyond 60 min, echoing findings of Ulpiani et al. [64] that local meteorology dominates longer horizons; and the second, the model was trained in a temperate coastal micro-climate—domain-adaptation studies are under way to ensure portability to arid and high-altitude regions. Season-specific evaluation shows that the Random Forest forecaster maintains MAE ≈ 5 AQI units in winter, spring, summer and autumn (ΔMAE < 1 unit), indicating year-round robustness.

In summary, the proposed end-to-end stack—low-cost sensors with adequate dynamic range, rigorous calibration, automated feature engineering and lightweight seq-to-one learners—delivers state-of-the-art short-term AQI forecasts while remaining fully deployable on commodity edge hardware, closing a key gap between lab-grade algorithms and real-world early-warning systems. The next hardware revision will integrate low-power electrochemical cells for

{N O}_{2}

and

O_{3}

, targeting sub-20 ppm detection limits. Adding these oxidizing species will extend the current framework to a more comprehensive AQI calculation and address the health-relevant gaps identified in this study. In addition to these operational benefits, the system’s methodological framework—spanning rigorous calibration, structured data pre-processing and optimized model selection—provides a reproducible template that strengthens its academic contribution to the field of air quality forecasting.

4. Conclusions

This study presents the development and evaluation of a real-time, low-cost and IoT-based air quality monitoring and forecasting system that integrates embedded hardware, wireless communication and machine learning models. The prototype, fabricated via Fused Deposition Modeling (FDM), features calibrated gas and climate sensors and operates through a microcontroller-based architecture supported by edge-cloud data transmission. A GRU-based deep learning model was found to offer the best forecasting performance, achieving an

R^{2}

value above 0.93 with inference latency below 130 ms, and successfully classified AQI categories with over 91% accuracy. Some limitations of the system should be acknowledged. Calibration procedures may require adaptation to varying environmental conditions, and the long-term stability and accuracy of low-cost sensors may be affected by sensor drift or external factors. In particular, the MQ-7 sensor used for CO measurement presents inherent challenges: its nominal detection threshold lies above levels already hazardous for human exposure, and individual units often display distinct response curves. Consequently, even simple sensor replacement may necessitate recalibration or retraining of the forecasting models. Moreover, the system is currently optimized for short-term AQI forecasting and does not address long-term pollution trends or emission source identification. Broader geographical testing and integration of additional environmental parameters, as well as cross-site validation strategies such as leave-one-site-out, will be valuable for future studies to further assess robustness across heterogeneous urban micro-environments. A subsequent hardware revision will replace the MQ-7 with a low-power electro-chemical CO cell (target LOD < 20 ppm) to provide accurate baseline measurements while preserving the existing power budget. Given the inherent limitations of low-cost sensors, such as sensitivity to humidity, temperature extremes and drift, sustained real-world accuracy will depend on periodic recalibration and validation across diverse environmental contexts. Despite these limitations, the outcomes of this work underscore the feasibility and effectiveness of integrating predictive analytics with embedded IoT systems to deliver scalable, sustainable and intelligent air quality forecasting solutions. Beyond its engineering feasibility, this study makes a scholarly contribution by bridging low-cost IoT sensing with advanced machine learning forecasting in a unified, reproducible framework. It extends the literature by demonstrating long-term, real-world validation in a Mediterranean urban micro-climate, an underrepresented context, and by providing a methodological template—combining calibration protocols, data pipelines and lightweight deep learning—that can inform and guide future academic research in environmental informatics and urban sustainability planning. The system offers strong potential for deployment in smart city applications, contributing to public health protection and environmental sustainability through affordable, modular and energy-efficient technology.

This study has certain limitations that should be acknowledged. Validation was performed on an intra-site basis, which may constrain the generalizability of the results to other environments. In addition, the sensor network covered only a limited number of locations, and the pollutant set was restricted to PM2.5,

C O

,

{C O}_{2}

and TVOC. While these choices reflect practical constraints of low-cost sensing, future studies could expand the evaluation to multi-site and multi-regional deployments and include additional pollutants such as

{N O}_{2}

or

O_{3}

for a more comprehensive assessment. Furthermore, although the models demonstrate strong accuracy for short horizons (30–60 min), extending the approach to longer-term predictions and assessing its performance under large-scale real-world deployments remain important directions for further research. While the present results confirm robust intra-site performance, future studies will extend validation across multiple sites and diverse regions to strengthen the transferability of the proposed approach.

Author Contributions

Conceptualization, H.S.S. and O.Y.; methodology, O.Y.; software, O.Y.; validation, H.S.S.; investigation, H.S.S.; data curation, O.Y.; writing—original draft preparation, O.Y.; writing—review and editing, H.S.S.; visualization, H.S.S.; supervision, O.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in this article. Further inquiries can be directed at the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kumar, P.; Rivas, I.; Singh, A.P.; Ganesh, V.J.; Ananya, M.; Frey, H.C. Dynamics of Coarse and Fine Particle Exposure in Transport Microenvironments. NPJ Clim. Atmos. Sci. 2018, 1, 11. [Google Scholar] [CrossRef]
Ghorani-Azam, A.; Riahi-Zanjani, B.; Balali-Mood, M. Effects of Air Pollution on Human Health and Practical Measures for Prevention in Iran. J. Res. Med. Sci. 2016, 21, 65. [Google Scholar] [CrossRef]
He, G.; Fan, M.; Zhou, M. The Effect of Air Pollution on Mortality in China: Evidence from the 2008 Beijing Olympic Games. J. Environ. Econ. Manag. 2016, 79, 18–39. [Google Scholar] [CrossRef]
Costs of Air Pollution from European Industrial Facilities 2008–2012. Available online: https://www.eea.europa.eu/en/analysis/publications/costs-of-air-pollution-2008-2012 (accessed on 22 June 2025).
Heal, M.R.; Kumar, P.; Harrison, R.M. Particles, Air Quality, Policy and Health. Chem. Soc. Rev. 2012, 41, 6606–6630. [Google Scholar] [CrossRef] [PubMed]
Rickenbacker, H.; Brown, F.; Bilec, M. Creating Environmental Consciousness in Underserved Communities: Implementation and Outcomes of Community-Based Environmental Justice and Air Pollution Research. Sustain. Cities Soc. 2019, 47, 101473. [Google Scholar] [CrossRef]
Williams, R.; Kilaru, V.; Snyder, E.; Kaufman, A.; Dye, T.; Rutter, A.; Russell, A.; Hafner, H. Air Sensor Guidebook; US Environmental Protection Agency: Washington, DC, USA, 2014. [Google Scholar]
US EPA. NAAQS Table. Available online: https://www.epa.gov/criteria-air-pollutants/naaqs-table (accessed on 22 June 2025).
Technical Assistance Document for the Reporting of Daily Air Quality|AirNow.Gov. Available online: https://www.airnow.gov/publications/air-quality-index/technical-assistance-document-for-reporting-the-daily-aqi/ (accessed on 17 May 2025).
Concas, F.; Mineraud, J.; Lagerspetz, E.; Varjonen, S.; Liu, X.; Puolamäki, K.; Nurmi, P.; Tarkoma, S. Low-Cost Outdoor Air Quality Monitoring and Sensor Calibration: A Survey and Critical Analysis. ACM Trans. Sens. Netw. 2021, 17, 20. [Google Scholar] [CrossRef]
Irawan, Y.; Wahyuni, R.; Fonda, H.; Hamzah, M.L.; Muzawi, R. Real Time System Monitoring and Analysis-Based Internet of Things (IoT) Technology in Measuring Outdoor Air Quality. Int. J. Interact. Mob. Technol. 2021, 15. [Google Scholar] [CrossRef]
Kang, J.; Hwang, K.-I. A Comprehensive Real-Time Indoor Air-Quality Level Indicator. Sustainability 2016, 8, 881. [Google Scholar] [CrossRef]
Kim, D.Y.; Kwoun, J.; Lee, T.J. Development of Indoor Air Quality Index for Vulnerable Group-Use Facilities. SSRN Electron. J. 2022. [Google Scholar] [CrossRef]
Al-Zaheiree, A.A.; AL-Zubaidi, Y.A.T.; Gaikwad, S.S.; Kamat, R.K. Advanced Air Pollution Detection Using IOT and Raspberry PI. Int. J. Recent Technol. Eng. 2020, 8, 5390–5394. [Google Scholar] [CrossRef]
Kamble, M.K.T.; Khatake, M.A.V.; Ghadyalji, M.A.C.; Chounde, A.B. IOT Based Air Pollution Monitoring System Using Raspberry Pi. Int. J. Adv. Res. Sci. Commun. Technol. 2022, 2, 332–336. [Google Scholar] [CrossRef]
Li, H.; Guo, H.; Zhai, Z.J. Evaluating the Effectiveness of Ventilation Strategies in Mitigating Short-Term and High-Concentration Indoor Pollutants within a Typical Apartment Building. Build. Environ. 2025, 270, 112520. [Google Scholar] [CrossRef]
Christakis, I.; Sarri, E.; Tsakiridis, O.; Moutzouris, K.; Triantis, D.; Stavrakas, I. Integrated Open Source Indoor Air Quality Monitoring Platform. In Proceedings of the 2024 9th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM), Egaleo, Greece, 20–22 September 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 183–188. [Google Scholar]
Sung, W.-T.; Hsiao, S.-J. Building an Indoor Air Quality Monitoring System Based on the Architecture of the Internet of Things. J. Wirel. Commun. Netw. 2021, 2021, 153. [Google Scholar] [CrossRef]
Samad, A.; Kieser, J.; Chourdakis, I.; Vogt, U. Developing a Cloud-Based Air Quality Monitoring Platform Using Low-Cost Sensors. Sensors 2024, 24, 945. [Google Scholar] [CrossRef]
Kumar, S.; Jasuja, A. Air Quality Monitoring System Based on IoT Using Raspberry Pi. In Proceedings of the 2017 International Conference on Computing, Communication and Automation (ICCCA), Greater Noida, India, 5–6 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1341–1346. [Google Scholar]
Jabbar, W.A.; Subramaniam, T.; Ong, A.E.; Shu’Ib, M.I.; Wu, W.; De Oliveira, M.A. LoRaWAN-Based IoT System Implementation for Long-Range Outdoor Air Quality Monitoring. Internet Things 2022, 19, 100540. [Google Scholar] [CrossRef]
Deepa, S.; Mohan, A.D.; Nikil, A.; Rajeshprabha, R. Smart Air Purifier and AQI Monitoring System. In Proceedings of the 2024 International Conference on IoT, Communication and Automation Technology (ICICAT), Gorakhpur, India, 23–24 November 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 192–195. [Google Scholar]
Talati, I.; Shah, K.J.; Patel, O.; Tanna, I.; Iain, A.; Oza, A.D.; Yadav, A.A.; Alshayeb, M.I.; Khan, M.A.; Islam, S. Study of AQI Monitoring System of Indoor Environment Using Machine Learning Model and IoT Device. Rocz. Ochr. Środowiska 2025, 27, 152–163. [Google Scholar] [CrossRef]
Mohammed, S.K.; Kamruzzaman, S.M.; Ahmed, A.; Hoque, A.; Shabnam, F. Design and Implementation of an Iot Based Forest Environment Monitoring System. In Proceedings of the 2019 IEEE 5th International Conference on Computer and Communications (ICCC), Chengdu, China, 6–9 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2152–2156. [Google Scholar]
Zhang, H.; Srinivasan, R. A Systematic Review of Air Quality Sensors, Guidelines, and Measurement Studies for Indoor Air Quality Management. Sustainability 2020, 12, 9045. [Google Scholar] [CrossRef]
Karagulian, F.; Barbiere, M.; Kotsev, A.; Spinelle, L.; Gerboles, M.; Lagler, F.; Redon, N.; Crunaire, S.; Borowiak, A. Review of the Performance of Low-Cost Sensors for Air Quality Monitoring. Atmosphere 2019, 10, 506. [Google Scholar] [CrossRef]
Saini, J.; Dutta, M.; Marques, G. Sensors for Indoor Air Quality Monitoring and Assessment through Internet of Things: A Systematic Review. Environ. Monit. Assess. 2021, 193, 66. [Google Scholar] [CrossRef]
Aini, Q.; Rahardja, U.; Manongga, D.; Sembiring, I.; Hardini, M.; Agustian, H. Iot-Based Indoor Air Quality Using Esp32. In Proceedings of the 2022 IEEE Creative Communication and Innovative Technology (ICCIT), Tangerang, Indonesia, 22–23 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar]
Castelli, M.; Clemente, F.M.; Popovič, A.; Silva, S.; Vanneschi, L. A Machine Learning Approach to Predict Air Quality in California. Complexity 2020, 2020, 8049504. [Google Scholar] [CrossRef]
Bhattacharya, S.; Shahnawaz, S. Using Machine Learning to Predict Air Quality Index in New Delhi. arXiv 2021, arXiv:2112.05753. [Google Scholar] [CrossRef]
Liu, H.; Li, Q.; Yu, D.; Gu, Y. Air Quality Index and Air Pollutant Concentration Prediction Based on Machine Learning Algorithms. Appl. Sci. 2019, 9, 4069. [Google Scholar] [CrossRef]
Alzu’bi, F.; Al-Rawabdeh, A.; Almagbile, A. Predicting Air Quality Using Random Forest: A Case Study in Amman-Zarqa. Egypt. J. Remote Sens. Space Sci. 2024, 27, 604–613. [Google Scholar] [CrossRef]
Liang, Y.-C.; Maimury, Y.; Chen, A.H.-L.; Juarez, J.R.C. Machine Learning-Based Prediction of Air Quality. Appl. Sci. 2020, 10, 9151. [Google Scholar] [CrossRef]
Farooq, O.; Shahid, M.; Arshad, S.; Altaf, A.; Iqbal, F.; Vera, Y.A.M.; Flores, M.A.L.; Ashraf, I. An Enhanced Approach for Predicting Air Pollution Using Quantum Support Vector Machine. Sci. Rep. 2024, 14, 19521. [Google Scholar] [CrossRef]
Janarthanan, R.; Partheeban, P.; Somasundaram, K.; Navin Elamparithi, P. A Deep Learning Approach for Prediction of Air Quality Index in a Metropolitan City. Sustain. Cities Soc. 2021, 67, 102720. [Google Scholar] [CrossRef]
Drewil, G.I.; Al-Bahadili, R.J. Air Pollution Prediction Using LSTM Deep Learning and Metaheuristics Algorithms. Meas. Sens. 2022, 24, 100546. [Google Scholar] [CrossRef]
Wu, Q.; Lin, H. A Novel Optimal-Hybrid Model for Daily Air Quality Index Prediction Considering Air Pollutant Factors. Sci. Total Environ. 2019, 683, 808–821. [Google Scholar] [CrossRef]
Rollo, F.; Bachechi, C.; Po, L. Anomaly Detection and Repairing for Improving Air Quality Monitoring. Sensors 2023, 23, 640. [Google Scholar] [CrossRef]
Ketu, S.; Mishra, P.K. Scalable Kernel-Based SVM Classification Algorithm on Imbalance Air Quality Data for Proficient Healthcare. Complex Intell. Syst. 2021, 7, 2597–2615. [Google Scholar] [CrossRef]
Samartha, S.; Prateek, M.G.; Ruchi, A.; Pallavi, B. Air Quality Monitoring and Forecasting Using IoT and ML: A Survey on Methodologies and Challenges. Int. J. Nov. Res. Dev. 2023, 8, b450–b455. [Google Scholar]
Neo, E.X.; Hasikin, K.; Lai, K.W.; Mokhtar, M.I.; Azizan, M.M.; Hizaddin, H.F.; Razak, S.A. Yanto Artificial Intelligence-Assisted Air Quality Monitoring for Smart City Management. PeerJ Comput. Sci. 2023, 9, e1306. [Google Scholar] [CrossRef]
Yang, R.; Yuan, Z.; Jiang, C.; Zhang, X.; Qiao, Z.; Zhang, J.; Liang, J.; Wang, S.; Duan, Z.; Wu, Y.; et al. Ultrafast Hydrogen Detection System Using Vertical Thermal Conduction Structure and Neural Network Prediction Algorithm Based on Sensor Response Process. ACS Sens. 2025, 10, 2181–2190. [Google Scholar] [CrossRef] [PubMed]
Zong, B.; Wu, S.; Yang, Y.; Li, Q.; Tao, T.; Mao, S. Smart Gas Sensors: Recent Developments and Future Prospective. Nano-Micro Lett. 2025, 17, 54. [Google Scholar] [CrossRef] [PubMed]
DFRobot DHT11 Humidity & Temperature Sensor. Available online: https://www.mouser.com/datasheet/2/758/DHT11-Technical-Data-Sheet-Translated-Version-1143054.pdf (accessed on 22 June 2025).
Hanwei Electronics Co., Ltd. MQ-7 Gas Sensor Technical Data. Available online: https://cdn.sparkfun.com/assets/b/b/b/3/4/MQ-7.pdf (accessed on 22 June 2025).
ScioSense ENS160 Digital Metal-Oxide Multi-Gas Sensor. Available online: https://dfimg.dfrobot.com/nobody/wiki/cbe10f01b67c3fee6d365039eb54f52c.pdf (accessed on 22 June 2025).
Sharp Application Note of Sharp Dust Sensor GP2Y1010AU0F. Available online: https://www.socle-tech.com/SHARP_sensor_Dust%20Sensor.php (accessed on 22 June 2025).
Yan, Y.; Li, Y.; Sun, M.; Wu, Z. Primary Pollutants and Air Quality Analysis for Urban Air in China: Evidence from Shanghai. Sustainability 2019, 11, 2319. [Google Scholar] [CrossRef]
Chen, J.; Li, C.; Ristovski, Z.; Milic, A.; Gu, Y.; Islam, M.S.; Wang, S.; Hao, J.; Zhang, H.; He, C. A Review of Biomass Burning: Emissions and Impacts on Air Quality, Health and Climate in China. Sci. Total Environ. 2017, 579, 1000–1034. [Google Scholar] [CrossRef]
Pamonpol, K.; Areerob, T.; Prueksakorn, K. Indoor Air Quality Improvement by Simple Ventilated Practice and Sansevieria Trifasciata. Atmosphere 2020, 11, 271. [Google Scholar] [CrossRef]
Tran, V.V.; Park, D.; Lee, Y.-C. Indoor Air Pollution, Related Human Diseases, and Recent Trends in the Control and Improvement of Indoor Air Quality. Int. J. Environ. Res. Public Health 2020, 17, 2927. [Google Scholar] [CrossRef]
Yang, X.; Yang, L.; Zhang, J. A WiFi-Enabled Indoor Air Quality Monitoring and Control System: The Design and Control Experiments. In Proceedings of the 2017 13th IEEE International Conference on Control & Automation (ICCA), Ohrid, Macedonia, 3–6 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 927–932. [Google Scholar]
Lachhab, F.; Bakhouya, M.; Ouladsine, R.; Essaaidi, M. Context-Driven Monitoring and Control of Buildings Ventilation Systems Using Big Data and Internet of Things–Based Technologies. Proc. Inst. Mech. Eng. Part I J. Syst. Control Eng. 2019, 233, 276–288. [Google Scholar] [CrossRef]
Taştan, M.; Gökozan, H. Real-Time Monitoring of Indoor Air Quality with Internet of Things-Based E-Nose. Appl. Sci. 2019, 9, 3435. [Google Scholar] [CrossRef]
Martín-Garín, A.; Millán-García, J.A.; Baïri, A.; Millán-Medel, J.; Sala-Lizarraga, J.M. Environmental Monitoring System Based on an Open Source Platform and the Internet of Things for a Building Energy Retrofit. Autom. Constr. 2018, 87, 201–214. [Google Scholar] [CrossRef]
EN 15267-1:2007; Air Quality—Certification of Automated Measuring Systems—Part 1: General Principles. European Committee for Standardization (CEN): Brussels, Belgium, 2007.
Umweltbundesamt. Available online: https://www.umweltbundesamt.de (accessed on 22 June 2025).
Winkler, N.P.; Neumann, P.P.; Schaffernicht, E.; Lilienthal, A.; Poikkimäki, M.; Kangas, A.; Säämänen, A. Gather Dust and Get Dusted: Long-Term Drift and Cleaning of Sharp GP2Y1010AU0F Dust Sensor in a Steel Factory. In Proceedings of the 38th Danubia-Adria Symposium on Advances in Experimental Mechanics, Poros Island, Greece, 20–23 September 2022. [Google Scholar]
Kobbekaduwa, N.; Oruthota, P.; De Mel, W.R. Calibration and Implementation of Heat Cycle Requirement of MQ-7 Semiconductor Sensor for Detection of Carbon Monoxide Concentrations. 2021. Available online: http://dr.lib.sjp.ac.lk/handle/123456789/10401 (accessed on 22 June 2025).
Pietraru, R.N.; Nicolae, M.; Mocanu, Ș.; Merezeanu, D.-M. Easy-to-Use MOX-Based VOC Sensors for Efficient Indoor Air Quality Monitoring. Sensors 2024, 24, 2501. [Google Scholar] [CrossRef]
Kurnia, D.; Hadisantoso, F.S.; Suprianto, A.A.; Nugroho, E.A.; Janizal, J. Real-Time Air Quality Index Monitoring Experiments Using SDS011 Sensors and Raspberry Pi. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1098, 042090. [Google Scholar] [CrossRef]
Das, S.; Singh, K.; Kaur, K. Air Quality Prediction in Beijing: Machine and Deep Learning Analysis. In ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2024; Volume 68, p. 01012. [Google Scholar]
Zhou, H.; Wang, T.; Zhao, H.; Wang, Z. Updated Prediction of Air Quality Based on Kalman-Attention-LSTM Network. Sustainability 2022, 15, 356. [Google Scholar] [CrossRef]
Ulpiani, G.; Duhirwe, P.N.; Yun, G.Y.; Lipson, M.J. Meteorological Influence on Forecasting Urban Pollutants: Long-Term Predictability versus Extreme Events in a Spatially Heterogeneous Urban Ecosystem. Sci. Total Environ. 2022, 814, 152537. [Google Scholar] [CrossRef]

Figure 1. System architecture of the air quality monitoring and forecasting system; (a) the hardware system; (b) hardware elements and connections; (c) overall pipeline.

Figure 2. The structure design of the air quality monitoring and forecasting system; (a) general view; (b) part list and exploded view; (c) engineering drawing.

Figure 3. Prototype of the air quality monitoring and forecasting device.

Figure 4. GUI of environmental air quality monitoring system.

Figure 5. Calibration curve for the MQ-7 CO sensor against the EN 15267-certified TSM-17 reference station. Red symbols and dashed regression line indicate pre-calibration values (slope ≈ 0.76,

R^{2}

≈ 0.98), while blue symbols and solid regression line show post-calibration values (slope ≈ 1.01,

R^{2}

≈ 0.996). The results confirm that the applied two-point calibration significantly improved accuracy and reduced bias.

Figure 5. Calibration curve for the MQ-7 CO sensor against the EN 15267-certified TSM-17 reference station. Red symbols and dashed regression line indicate pre-calibration values (slope ≈ 0.76,

R^{2}

≈ 0.98), while blue symbols and solid regression line show post-calibration values (slope ≈ 1.01,

R^{2}

≈ 0.996). The results confirm that the applied two-point calibration significantly improved accuracy and reduced bias.

Figure 6. Calibration curve for the GP2Y10 PM2.5 sensor against the EN 15267-certified TSM-17 reference station. Pre-calibration data (red scatter, dashed line) show underestimation with slope ≈ 0.70 and

R^{2}

≈ 0.95. Post-calibration data (blue scatter, solid line) closely align with the reference station (slope ≈ 1.02,

R^{2}

≈ 0.995), demonstrating effective correction of systematic error.

Figure 6. Calibration curve for the GP2Y10 PM2.5 sensor against the EN 15267-certified TSM-17 reference station. Pre-calibration data (red scatter, dashed line) show underestimation with slope ≈ 0.70 and

R^{2}

≈ 0.95. Post-calibration data (blue scatter, solid line) closely align with the reference station (slope ≈ 1.02,

R^{2}

≈ 0.995), demonstrating effective correction of systematic error.

Figure 7. Actual vs. predicted AQI values for four models at the 60 min forecast horizon. Scatter plots show the fitted regression line (solid line) and the 1:1 reference (dashed line). Reported slopes (β) and

R^{2}

are annotated in each panel; (a) AdaBoost.R2; (b) GRU; (c) LSTM; (d) Stacking RF+SVR+XGB. Predictions represent continuous regression outputs of AQI prior to discretization into US-EPA health-based categories.

Figure 7. Actual vs. predicted AQI values for four models at the 60 min forecast horizon. Scatter plots show the fitted regression line (solid line) and the 1:1 reference (dashed line). Reported slopes (β) and

R^{2}

are annotated in each panel; (a) AdaBoost.R2; (b) GRU; (c) LSTM; (d) Stacking RF+SVR+XGB. Predictions represent continuous regression outputs of AQI prior to discretization into US-EPA health-based categories.

Figure 8. Time-series comparison of actual and predicted AQI over a 7-day window (60 min forecast horizon) for four models. The recurrent models capture diurnal patterns more faithfully, while ensemble models show attenuated peaks and slight lag. Slopes (β), intercepts (α) and R² values computed against the ground truth are annotated in each panel; (a) AdaBoost.R2; (b) GRU; (c) LSTM; (d) Stacking RF+SVR+XGB. Forecast traces reflect regression-based AQI values, which were subsequently discretized into US-EPA health-based categories for classification evaluation.

Figure 9. Confusion matrix of GRU model predictions for AQI categories. The strong diagonal pattern demonstrates high accuracy across all classes, with only limited confusion between adjacent categories. Classification performance is based on discretized regression outputs, with categories defined according to US-EPA health-based breakpoints.

Table 1. Pollutant-specific sub-indices and cautionary statements [9]. These breakpoint intervals form the basis of the method used in this study, in which continuous pollutant concentrations were first predicted by regression and then discretized into categorical health-based AQI levels.

PM2.5 (μg/m³) 24-h	$C O$ (ppm) 8-h	AQI	Category
0.0–9.0	0.0–4.4	0–50	Good
9.1–35.4	4.5–9.4	51–100	Moderate
35.5–55.4	9.5–12.4	101–150	Unhealthy for Sensitive Groups
(55.5–125.4)³	12.5–15.4	151–200	Unhealthy
(125.5–225.4)³	15.5–30.4	201–300	Very unhealthy
225.5+	30.5+	301+	Hazardous

Table 2. Sensors and manufacturers specified ranges. These sensors were integrated into the prototype device, calibrated against a certified reference station and used as the basis for AQI computation according to US-EPA breakpoints, with VOC/e

{C O}_{2}

serving only as advisory indicator.

Table 2. Sensors and manufacturers specified ranges. These sensors were integrated into the prototype device, calibrated against a certified reference station and used as the basis for AQI computation according to US-EPA breakpoints, with VOC/e

{C O}_{2}

serving only as advisory indicator.

Sensor	Parameters	Nominal Range	Operating Conditions
DHT11 [44]	Temp, Hum	Temp = 0–50 °C; Hum = 20–90% RH	Temp = 0 to 50 °C; Hum = 20% to 90%
MQ-7 [45]	$C O$	20–2000 ppm	Temp = −20 °C to 50 °C; Hum = <95% RH
Gravity ENS160 [46]	TVOCs, e ${C O}_{2}$	0–65,000 ppb, 400–65,000 ppm	Temp = −40 °C to 85 °C; Hum = 5% to 95%
GP2Y10 Dust Sensor [47]	PM2.5	0–600 ug/m³	Temp = −10 °C to +65 °C

Table 3. Machine learning models and hyper-parameter search space for short-term AQI prediction.

Model	Rationale	Key Hyper-Parameters
Stacking Ensemble (RF + SVR + XGB → LR meta)	Combines heterogeneous inductive biases; robust to non-stationarity	n estimators = 50–300; learning-rate = 0.05–1
AdaBoost.R2	Excels at bias reduction on noisy short horizons	n estimators = 50–300; learning-rate = 0.05–1
LSTM	Captures temporal dependencies without manual lags	1–2 layers × 32–64 cells; dropout = 0.2
GRU	Fewer parameters, comparable accuracy to LSTM	Same grid as LSTM

Table 4. Summary of features and target label used in the machine learning models.

Feature Type	Features	Description
Pollutant concentrations	PM2.5, $C O$ , ${C O}_{2}$ , TVOC	Hourly mean pollutant levels measured by sensors
Meteorological variables	Temperature, Relative Humidity	Hourly averages recorded by DHT11 sensor
Temporal flags	Hour-of-day, Day-of-week	One-hot encoded categorical features
Derived statistics	Lag terms (t − 1, t − 3, t − 6), rolling mean, variance	Capture short-term temporal persistence and variability
Target label	AQI Category (Good, Moderate, USG, Unhealthy, Very Unhealthy/Hazardous)	Derived from synchronous EN 15267-certified TSM-17 reference station data using CPCB/US-EPA breakpoints

Table 5. Regression performance of different models for AQI forecasting at 30 min and 60 min horizons. Reported error metrics are expressed in AQI index units (dimensionless) rather than pollutant concentration (µg/m³). Continuous AQI values were first predicted via regression and subsequently discretized into US-EPA health-based categories, following the regression → discretization strategy adopted in this study.

Model	+30 min				+60 min
Model	MAE	RMSE	$R^{2}$	MAD	MAE	RMSE	$R^{2}$	MAD
Stacking (RF + SVR + XGB)	5.8	8.5	0.93	7.2	10.9	15.1	0.78	13.8
AdaBoost.R2	6.1	9.0	0.91	7.5	11.6	15.9	0.75	14.5
LSTM	5.7	8.4	0.93	7.0	9.9	13.8	0.82	12.6
GRU	5.6	8.2	0.94	6.9	10.2	14.3	0.81	12.9

Table 6. Classification performance of models for health-based AQI categories. Continuous AQI values were first predicted via regression and subsequently discretized into categories defined by CPCB/US-EPA breakpoints. Reported metrics therefore reflect the classification accuracy of these discretized regression outputs.

AQI Category	Precision	Recall	F1-Score
Good	0.93	0.96	0.94
Moderate	0.89	0.84	0.86
Unhealthy for Sensitive Groups	0.82	0.79	0.80
Unhealthy	0.87	0.71	0.78
Very unhealthy	0.90	0.65	0.75
Macro avg	0.88	0.79	0.83
Weighted avg	0.91	0.91	0.90

Table 7. Pollutant-specific sub-indices and cautionary statements [57]. These indices, adapted from UBA-style indoor air quality labeling, were used only as advisory indicators and were not included in the US-EPA AQI calculation.

TVOCs (ppb)	$e {C O}_{2} / {C O}_{2}$ (ppm)	AQI	Category
<50	400–600	1	Excellent
<50	600–800	2	Good
50–750	800–1000	3	Moderate
750–6000	1000–1500	4	Poor
>6000	>1500	5	Unhealthy

Table 8. Comparative 1 h AQI forecasting accuracy reported in recent studies.

Study	Hardware	Horizon (h)	RMSE (µg/m³)
Das et al., 2024 [62]	Fixed-station, RF	1	15.2
Zhou et al., 2023 [63]	LSTM-attention	1	12.7
This work (LSTM)	Low-cost mobile node	1	13.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yildiz, O.; Sucuoglu, H.S. Development of Real-Time IoT-Based Air Quality Forecasting System Using Machine Learning Approach. Sustainability 2025, 17, 8531. https://doi.org/10.3390/su17198531

AMA Style

Yildiz O, Sucuoglu HS. Development of Real-Time IoT-Based Air Quality Forecasting System Using Machine Learning Approach. Sustainability. 2025; 17(19):8531. https://doi.org/10.3390/su17198531

Chicago/Turabian Style

Yildiz, Onem, and Hilmi Saygin Sucuoglu. 2025. "Development of Real-Time IoT-Based Air Quality Forecasting System Using Machine Learning Approach" Sustainability 17, no. 19: 8531. https://doi.org/10.3390/su17198531

APA Style

Yildiz, O., & Sucuoglu, H. S. (2025). Development of Real-Time IoT-Based Air Quality Forecasting System Using Machine Learning Approach. Sustainability, 17(19), 8531. https://doi.org/10.3390/su17198531

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Development of Real-Time IoT-Based Air Quality Forecasting System Using Machine Learning Approach

Abstract

1. Introduction

2. Materials and Methods

2.1. System Architecture of Air Quality Monitoring and Forecasting Device

2.2. Structure Design of the Air Quality Monitoring and Forecasting System Device

2.3. Prototyping and Assembly of the Device

2.4. Data Acquisition

2.4.1. Sensor Network and Sampling

2.4.2. Edge-Cloud Data Path

2.4.3. On-Device Calibration

2.4.4. Pre-Processing Pipeline

2.5. Training Strategy

2.5.1. Feature Engineering

2.5.2. Model Portfolio

2.5.3. Training Pipeline

3. Results and Discussion

3.1. Forecast Accuracy on Short Horizons

3.2. Agreement with Health-Based AQI Categories

3.3. Influence of Sensor Selection

3.4. Comparison with Recent Literature

3.5. Major Contributions and Insights

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI