Towards Sustainable Urban Mobility: Leveraging Machine Learning Methods for QA of Meteorological Measurements in the Urban Area

Sládek, David; Marková, Lucie; Talhofer, Václav

doi:10.3390/su16135713

Open AccessArticle

Towards Sustainable Urban Mobility: Leveraging Machine Learning Methods for QA of Meteorological Measurements in the Urban Area

by

David Sládek

^*

,

Lucie Marková

and

Václav Talhofer

Department of Military Geography and Meteorology, Faculty of Military Technology, University of Defence, 662 10 Brno, Czech Republic

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(13), 5713; https://doi.org/10.3390/su16135713

Submission received: 30 April 2024 / Revised: 5 June 2024 / Accepted: 2 July 2024 / Published: 4 July 2024

(This article belongs to the Special Issue Transportation Planning and Land Use in Urban and Rural Sustainable Development)

Download

Browse Figures

Versions Notes

Abstract

:

Non-professional measurement networks offer vast data sources within urban areas that could significantly contribute to urban environment mapping and improve weather prediction in the cities. However, their full potential remains unused due to uncertainties surrounding their positioning, measurement quality, and reliability. This study investigates the potential of machine learning (ML) methods serving as a parallel quality control system, using data from amateur and professional weather stations in Brno, Czech Republic. The research aims to establish a quality control framework for measurement accuracy and assess ML methods for measurement labelling. Utilizing global model data as its main feature, the study examines the effectiveness of ML models in predicting temperature and wind speed, highlighting the challenges and limitations of utilizing such data. Results indicate that while ML models can effectively predict temperature with minimal computational demands, predicting wind speed presents greater complexity due to the higher spatial variability. Hyperparameter tuning does not significantly influence model performance, with changes primarily driven by feature engineering. Despite the improved performance observed in certain models and stations, no model demonstrates superiority in capturing changes not readily apparent in the data. The proposed ensemble approach, coupled with a control ML classification model, offers a potential solution for assessing station quality and enhancing prediction accuracy. However, challenges remain in evaluating individual steps and addressing limitations such as the use of global models and basic feature encoding. Future research aims to apply these methods to larger datasets and automate the evaluation process for scalability and efficiency to enhance monitoring capabilities in urban areas.

Keywords:

quality control; urban weather; machine learning; gradient boosting; amateur measurements

1. Introduction

In the contemporary era of meteorological research, the utilization of unconventional data sources, including amateur weather stations and crowdsourced information, has emerged as a pivotal area of investigation. This study is poised at the intersection of machine learning (ML) methodologies and the estimation of unreliable measurements from amateur stations, specifically focusing on the domains of air temperature and wind speed. The GFS and ECMWF global models serve as essential prediction data for ML modelling.

Our study aims to explore and utilize machine learning methods to control the quality of non-professional weather measurements, particularly focusing on the estimation of air temperature and wind from the non-professional stations within the Brno area. The main objective of this research is to leverage machine learning techniques to improve the accuracy and reliability of weather data collected from non-professional weather stations. By doing so, we aim to establish a comprehensive urban weather observation and nowcasting system that can provide reliable and error-free weather information for a range of applications. Acknowledging the limitations of the input data, we aim to develop methods that can detect and label abnormal observations, fill in missing values, and ensure the temporal and spatial consistency of the measured weather data.

The paradigm shift towards incorporating data from non-traditional sources is evident in the growing interest in citizen science, amateur weather stations, and the broader context of crowdsourcing [1,2,3]. Crowdsourcing techniques have become increasingly vital, especially in densely populated areas or regions lacking comprehensive meteorological networks [3,4,5,6,7,8,9]. As urbanization continues to rise, the need for diverse, high-resolution data sources becomes imperative.

Amateur meteorological stations, often operated on semi-professional levels, contribute valuable data that, while abundant, may not conform to established WMO standards [10]. The challenges lie in the heterogeneity of measuring conditions (e.g., position of the station) and variations in instrument quality [11,12,13]. A notable increase in the number of stations, particularly in proximity to population centers, can be attributed to the development and affordability of mobile applications [14]. These developments align with the principles of crowdsourcing, where individuals actively participate in data collection and dissemination.

Addressing the unique characteristics of amateur stations, quality control becomes a crucial phase in data processing. The identification of missing data periods, error detection in measurements, and subsequent series harmonization are essential steps [6,15]. The wide scope of climatological analyses necessitates diverse statistical tests and instruments [15,16,17]. The significance of this phase is underscored by the need to supplement missing or erroneous values, ensuring the completeness of records.

Temperature as such is, of course, at the forefront of scientific interest. Whether it is global surface temperature predicted by empirical methods [18], research on global-scale temperature correlations [19], or methods for predicting global temperature within climate models [20].

Due to the complexity of statistical apparatus, the research team from Google DeepMind developed a hybrid approach combining machine learning (ML) systems with global models in their investigation [21]. This methodology enables the thorough evaluation and interpretation of 0.25° resolution global models used in this work. Previous work from the South Californian Savannah River National Laboratory assessed the accuracy of the global forecasting system GFS and identified surface errors [22]. The analysis revealed that significant errors are primarily linked to mountainous regions and areas with steep topography, with additional influence from land–sea contrasts. Decomposing the error into bias, variance, and correlation components, it was observed that initial errors for 0 h forecasts were primarily driven by a substantial forecast bias, persisting across longer forecast horizons, alongside notable errors in forecast correlation.

The feasibility of the idea of using global climate models to assess the quality of global observational datasets was tested by Massonet et al. [23]. The study compares four observational datasets of sea surface temperature with a comprehensive multimodel climate forecast ensemble. The analysis reveals the evidence indicating that models consistently perform better when evaluated against the most recent, advanced, and independent observational dataset. The findings underscore the need for standardized procedures in comparing models with observations.

Overall, machine learning techniques can help to accurately predict temperatures based on a set of input features that may include, but are not limited to, previous values of temperature, relative humidity, solar radiation, rain, and wind speed measurements. The review shows that deep learning strategies exhibit smaller errors. Globally, Support Vector Machines are preferred based on a good trade-off between simplicity and accuracy. In addition, the accuracy of the methods described in this paper was found to depend on the combination of inputs, architecture, and learning algorithms [24].

The nature of the phenomenon itself makes it an interesting situation for wind forecasting. It is essential for numerous services and safety and has significantly improved in accuracy due to machine learning advancements [25]. Most works employed neural networks, focusing recently on deep learning models [26,27]. Among the reported performance metrics, the most prevalent were mean absolute error, mean squared error, and mean absolute percentage error. The results underscore the novel effectiveness of machine learning in predicting wind conditions using high-resolution time data and demonstrate that deep learning models surpass traditional methods, improving the accuracy of wind speed and direction forecasts. Moreover, it was found that the inclusion of non-wind weather variables does not benefit the model’s overall performance. Further studies are recommended to predict both wind speed and direction using diverse spatial data points, and high-resolution data are recommended along with the use of deep learning models.

An interesting modification that has been suitably applied to the wind energy sector is the attempt to predict the wind speed prediction error [28,29]. The idea of the work is to use one model, e.g., ARIMA and then apply another model, e.g., Random Forest [28], to estimate the error of the model. Also, the authors of the study presenting the wind farms case study [30] suggest using combinations or hierarchy of the models to predict and evaluate wind fields.

However, the complexity of the urban environment is well captured by work on building-scale modelling [31]. The authors use advanced CFD modelling and thermodynamic computation methods and identify the complexity of modelling meteorological features in urban environments with realistic surface properties. The actual modelling using the Partial Least Squares Regression (PSLR) algorithm in urban areas has already been tested in the urbanization of Yerevan, Armenia [32]. The authors recommend that further development should include the incorporation of additional weather parameters from weather stations, such as precipitation and wind speed, as well as the use of non-parametric ML techniques.

Our research represents the intersection of two growing trends in meteorology: the use of unconventional data sources like citizen science and amateur weather stations, and the application of machine learning for weather data analysis.

While traditional quality control methods rely on statistical tests, our work explores the potential of machine learning models to improve the accuracy and reliability of data collected from non-professional stations, particularly focusing on air temperature and wind speed in the Brno urban area. This approach complements existing research on quality control for crowdsourced weather data while also contributing to the broader field of applying machine learning to urban weather monitoring and nowcasting.

Our work deals exclusively with data from the urban area of Brno, in the Czech Republic, where a number of studies have already attempted to capture heat island properties [33,34] or even the quality of amateur observations themselves [16,17].

While high-resolution local models offer advantages in capturing fine-grained meteorological variations, this study explores the feasibility of using a globally applicable approach for quality control (QA) of amateur weather station data in urban environments. Leveraging forecast data from the GFS and ECMWF models offers wider applicability compared to localized models, and these global datasets are readily available for various regions.

Our research questions are designed to investigate the effectiveness of this approach:

Effectiveness of ML models with global data: How effectively do machine learning (ML) models predict temperature and wind speed at amateur stations using forecast data from GFS and ECMWF models?
Model Performance and Biases: Which ML models demonstrate superior performance in this context, and are there any notable biases exhibited by specific models?
Model Behavior under Different Conditions: Under what specific conditions do certain models excel or struggle compared to others?
Predictor Importance: What are the key factors (predictors) influencing the prediction of measured values by the models?

By addressing these questions, we aim to validate the concept that a robust predictive system, built upon a globally applicable dataset, can serve as an effective tool for detecting anomalies in amateur weather station data. These anomalies might indicate sensor calibration issues or shifts in ambient conditions.

The following sections will focus on the details of our research methodology. We will discuss the data sources, including both observations and predictions, and how they are framed within a data science framework (targets and features). We will explore various machine learning methods employed in the study, including linear regression, Bayesian Ridge, Random Forest, and Gradient Boosting. A subset of these models will be chosen for further fine-tuning and detailed analysis. The results section will present both fundamental accuracy metrics and the importance of individual predictors within the chosen models. We will compare the performance of calibrated models to assess the impact of hyperparameter tuning. Additionally, we will explore the capabilities of ensemble evaluation for detecting model inaccuracies and utilize machine learning models for outlier detection. Finally, we will investigate the potential of probability evaluation for identifying outliers based on uncertainty levels in predictions from tree-based models.

This approach has the potential to improve sensitivity in the evaluation process, allowing for the identification of data points with specific probability thresholds. Overall, our research aims to contribute to the utilization of underutilized weather data for urban environmental mapping. This can lead to a better understanding of the complexities of urban environments and ultimately support sustainable urban development.

2. Data and Methods

In this chapter, we outline the sources of data utilized for prediction and analysis within our study. These data sources are categorized into two main groups:

Observations:
- Data collected from amateur weather stations serves as target values for model predictions.
- Observations obtained from the reference station, specifically Brno-Tuřany airport station (ICAO LKTB), offer a benchmark for comparison.
Numerical Weather Prediction (NWP) Models:
- Global Forecast System (GFS) with a grid resolution of 0.25 degrees,
- European Centre for Medium-Range Weather Forecasts (ECMWF) with a grid resolution of 0.08 degrees (approximately 9 km).

The merging of these data sources provides a foundation for training, testing, and interpreting the performance of individual machine learning models. In addition, predictors are also used to re-analyze the outputs of ML models and to find underlying correlations.

2.1. Observations

The data utilized in this study comprise observations obtained from January 2016 to December 2023 from one reference professional station (Brno-Tuřany airport station), alongside two test stations: Brno Hvězdárna and Štýřice (Figure 1).

These primarily amateur-operated stations figure as the main target data. At the Štýřice station, data are collected by a private person, while the Brno Hvězdárna station is managed by professional astronomers. Neither of these stations conforms entirely to standard sensor placement guidelines [10]. The Hvězdárna (observatory) station is positioned atop a hill, adjacent to a building, sidewalks, and trees (Figure 2). Station Štýřice is located within the garden of a residential property in the southern region of Brno, Štýřice district. Its altitude stands at 205 m above sea level [35].

At the reference station (Brno-Tuřany airport station, ICAO LKTB), we also used information about relative humidity, temperature, phenomena, or wind. These were retrieved from the METAR aviation reports provided at half-hourly intervals within ICAO standards [36].

The described data were used to create sets of predictors and target values. The overall extent of the values for 8 years was about 70,000 h in the 30/10/4 min step (Airport, Štýřice, Hvězdárna). For these, abbreviations are used in figures and graphs, as presented in Table 1.

The aggregation and parameterization of the data are detailed in the following subsections. For the NWP data, we predominantly utilized the GFS model due to its archive dating back to 2015, allowing us to leverage the full dataset. Nonetheless, we recognize that the ECMWF’s higher resolution could significantly impact the application.

2.2. Global Models Data

Two globally accessible models, GFS and ECMWF, were employed for value prediction. GFS operates on a grid of 0.25 degrees, while ECMWF HRES utilizes a finer grid of 0.08 degrees. Temperature and wind predictions at standard measurement heights (2 m for temperature and 10 m for wind) were utilized. The analysis utilized the u and v wind components, defined as:

w s = \sqrt{u^{2} + v^{2}}

(1)

thus, for the wind direction

φ

:

u = w s * \cos φ

(2)

v = w s * \sin φ

(3)

For the completeness of the analysis, we also tried to include a lifted index (LI) quantifying the instability of the atmosphere [37]. This should quantify, albeit in a limited way, the mixing of layers and overall stratification. Although we acknowledge the suitability of other indexes, LI was present in all archive extents and presents a very clear method of calculation. It is defined as the temperature difference between the particle and the environment in the layer (usually used at 500 hPa).

2.3. Data Engineering

Overall, the classification of phenomena is an interesting methodological question for this issue. The assumption is that phenomena can have a significant impact on the measured quantities. However, it is not always easy to express them well enough, so that they are well understood by ML algorithms. For example, for fog classification, it is difficult to say whether the fog is only a consequence of precipitation or was already there before the precipitation. Also, the effect of thunderstorms, rain, or snowfall is erased by our classification but certainly has different consequences (Table 2). Therefore, parameterization is named later as an essential factor to improve the prediction.

Cloud cover is listed in our METAR reports in three layers that report coverage amounts and heights in hundreds of feet. We have chosen only the BKN or OVC layer as significant. Thus, we chose a simple parameterization (Table 3).

These cloud cover classes help us to easily detect any significant cloud cover effects without putting a significant burden on the computational side of the process. Using two predictors for each layer (height and cloud cover) would reduce their importance, and we would be required to encode the cloud cover.

We also opted for using the goniometric functions of the day of the year and the hour of the day. Because ML models can more accurately interpret the fact that, for example, the 360th day of the year is close in conditions to 1. Not the 200th day, as expressed in Equations (4) and (5):

d a y_\cos = c o s (\frac{2 π \times d a y}{365})

(4)

d a y_\sin = s i n (\frac{2 π \times d a y}{365})

(5)

These equations lead to the curves (Figure 3) that could separately correspond to different atmospheric parameters. For instance, the cosine of the diurnal cycle is the shifted, inverted value of the incoming solar radiation during a cloudless day.

The aforementioned methods of interpreting predictors can help facilitate processing by ML algorithms and increase their importance in prediction.

2.4. Tested Machine Learning Methods

For testing ML regression methods we used those commonly used (Table 4) ones available in the Python library scikit.learn [38].

After fitting the model to the training data, cross-validation was performed on the test set to estimate the regression or classification accuracy. The classification was utilized to a lesser extent, primarily for estimating the magnitude of ML system errors. The following metrics were applied during cross-validation:

Regression:
- Mean Absolute Error (MAE)
- Maximum Error
Classification:
- Accuracy
- Precision
- Recall
- F1 Score

While other metrics such as R-squared (R²), Root Mean Squared Error (RMSE), or Area Under the Curve (AUC) could provide more comprehensive validation, we opted for simplicity and interpretability by focusing on the aforementioned metrics, which are considered sufficiently informative.

2.5. Gradient-Boosting Multioutput Regression

The Gradient-Boosting technique, as described in prior research [39], offers a unique approach to estimating functions. Unlike traditional methods that pull parameters directly, it focuses on optimizing functions themselves. This method introduces gradient boosting algorithms tailored for different fitting criteria, such as least-squares and least absolute deviation. By applying gradient boosting to regression trees, it produces robust, easy-to-understand procedures suitable for both regression and classification tasks, especially useful for analyzing noisy datasets. As for the hyperparameter tuning, for the GB algorithm, we can try various combinations and anticipated contributions that include the number of estimators, which controls the number of weak learners in the ensemble, and the learning rate, which regulates the contribution of each learner [38]. Additionally, tree-specific parameters such as max depth, min samples split, min samples leaf, and max features influence the complexity and generalization ability of individual decision trees. Subsample determines the fraction of samples used for training each learner, introducing randomness to prevent overfitting. The choice of loss function, such as mean squared error for regression or logistic loss for classification, depends on the problem type and desired prediction characteristics. Finally, regularization parameters like alpha and lambda control the regularization strength, promoting sparsity and discouraging large coefficients to mitigate overfitting.

Furthermore, this approach is enhanced by a multi-target stacking variant, which extends the technique to handle multi-target regression scenarios [40]. Additionally, a new formulation of gradient boosting for regression tasks is proposed, enabling iterative enhancement of a non-constant model. This iterative process allows the integration of prior knowledge or insights, thereby boosting performance even with limited training examples.

2.6. Exploratory Analysis

The aim of the exploratory analysis was to find out the distribution of values, how they correspond to each other, and whether some values are significantly shifted. First, we viewed histograms of temperature and wind speed measurements at the stations and the NWP models (Figure 4).

The temperature histogram does not show any significant differences in the distribution. Only at the airport reference station (METAR), there is a noticeable difference in temperature reporting at 1 °C resolution. Perhaps only the shift in GFS temperatures compared to the others suggests that the GFS is predicting fewer extremes and rather cooler temperatures, as the point is located outside the city, to the northeast.

For wind data, the GFS and ECMWF are quite similar, though the GFS generally predicts higher wind speeds. Interestingly, the target wind speed measured at the Hvězdárna often indicates low to no wind. This discrepancy might be attributed to the type of sensor used or its placement at a non-standard height and in a wind lee. We will further explore how machine learning models can account for this situation.

Further, the Kendall correlation was chosen as another exploratory method. This could indicate the suitability of the individual predictors (features) for modelling (Figure 5).

The correlation indicates expected high values, particularly in temperatures themselves. Observed temperatures correlate with each other around 0.88–0.92. The correlation with key predictors, namely Temperature GFS and Temperature ECMWF, is also relatively high, 0.85–0.93 for ECMWF and 0.74–0.84 for the GFS. A negative correlation of −0.5 is expected with the Lifted Index (LI), where high temperatures often result in very low LI values (indicating higher atmospheric instability). As for other variables, a high negative correlation is observed with the cosine of the day of the year. While other variables show correlations around 0.2–0.3, given their high correlations with NWP model temperatures, they may not be significantly utilized by ML models.

Correlation was also calculated for wind speed values (Figure 6) at the Hvězdárna Station (Ws Hve) and at the reference station Brno Airport (sknt—wind speed and drct—wind direction). Much lower values were expected there, as neither the selection of predictors nor the nature of the diurnal wind speed cycle is as continuous and dependent as the temperature. Moreover, its local variability may be much higher, especially in urbanized areas.

From the correlation table (Figure 6), we expect the wind modelling to be quite complex, relying mainly on the weaker predictors. The highest contribution might be from the NWP model predictions, the previous value from the reference station (sknt_sh), and the hour of the day (Hour_cos), which represents the diurnal cycle. Naturally, it also depends on the correlation and relationships of each predictor, so their importance for some methods (e.g., Random Forest) may be eliminated with each other.

2.7. Baseline Models

Initial analysis focused on comparing the performance of numerical models (NWP) with observed values, establishing a baseline against which we aim to enhance the performance of our ML algorithms. Subsequently, we employed basic uncalibrated forecast models to gauge their accuracy in responding to measurements across different stations, referring to these models as baseline models.

The fundamental benchmark for accuracy is set by the predictions of the numerical models. Therefore, we utilize the forecasts generated by the global GFS and ECMWF models, along with their Mean Absolute Error (MAE) and Maximum Error (MaxE), as the reference standard (Table 5).

For the wind forecast, we tested only Hvězdárna and the reference station. If we used directly the model points and their values, we would get the average and maximum error given in the table (Table 6).

Comparing errors in Table 5 and Table 6, alone does not provide sufficient evidence to determine the overall suitability for ML modelling. Our objective is to uncover and replicate underlying patterns, and the errors observed merely indicate significant disparities between the models and observations.

Firstly, uncalibrated models for temperature regression were tested. The predictors (features) comprised all the values listed in Table 1, including:

Temperature, U and V wind components from ECMWF, and GFS predicted 6 h prior to observation.
Lifted index from GFS model predicted 6 h prior to observation.
Reported temperature, relative humidity, wind speed, and wind direction from 6 h prior at the reference station Brno Airport,
Hour of the day and day of the year.

In this case, because all predictors were used in Figure 6, some models may have been obstructed by the unreduced predictor space. There were many non-relevant or highly correlated predictors. While the performance of, e.g., Decision Trees is one of the worst at both stations, a big performance contrast can be seen for ElasticNet and Lasso. Indeed, these perform poorly at the Štýřice station but are among the good stations at the Hvězdárna station. Among effects such as hyperparameter tuning and other external factories, we select two that we assume to be the main drivers of this difference:

Data characteristics: Hvězdárna measurements could fit better to the assumptions of the ElasticNet and Lasso regressions (namely linearity and sparsity).
Feature Importance: Elastic Net and Lasso use feature selection techniques that penalize less important features by shrinking their coefficients towards zero. Thus, it seems like the station Hvězdárna has fewer irrelevant or redundant features, therefore these models may perform better by effectively selecting the most relevant predictors.

Since the number of relevant predictors can be an important factor, for the following example (Figure 7) we used the reduced proctor of predictors only to: Lifted Index, Temperature GFS, Temperature ECMWF, Temperature from reference (T-6 h), Dewpoint from reference station (T-6 h), Day_of_the_year_sin.

After the reduction in features space, two different situations happened (Figure 8). The performance of the models became equal for the station Štýřice, or worse for the well-performing models (GB). For the Brno-Hvězdárna station, the results generally improved slightly, but their order remained more or less unchanged. Here, we attempt to identify several possible reasons:

Data quality: the data at the Hvězdárna station may be of better quality, with fewer errors compared to the first station.
Feature relevance: although the feature space is reduced to only relevant features at both stations, the relevance of these features at the two stations may differ.
Underlying patterns: This may be due to factors such as more consistent weather patterns, less variability in environmental conditions, or better instrumentation.

For the wind speed, we tested the same set of models (Figure 9). We predicted mean wind speed from the last measurement interval (4 min for Hvězdárna). A reduced set of predictors was used, limited only to U and V components of wind from GFS and ECMWF, GFS Lifted index, Hour_sin, Hour_cos, Day_of_year_sin, Day_of_year_cos, skyl1, Temperature ECMWF, Wind speed and direction from GFS and ECMWF.

Figure 8. Comparison of the regression model based on the reduced space of predictors: predicting temperature at (a) Brno-Štýřice; (b) Brno-Hvězdárna.

The findings demonstrate generally accurate predictions, as all models achieved a Mean Absolute Error (MAE) about 1 m/s. However, the presence of maximum errors ranging from 3 to 8 m/s indicates room for improvement. This discrepancy is likely attributed to abrupt wind speed changes that the global model struggles to capture. In the forthcoming Results section, we will delve deeper into the reduction in these high errors.

In terms of method comparison, Gradient Boosting and Random Forest exhibited notably stable performance. We plan to further explore these models by visualizing feature importance, which will elucidate the variables that significantly influenced each model’s predictions.

3. Results

In this section, we illustrate how the ML models performed and what features they were based on. We also explore the potential of the Gradient Boosting Multi-output model and discuss its predictions. Finally, we attempt to use another ML model to explore the potential of the model’s predictive inaccuracies.

3.1. Comparison of the Calibrated Models

For further calibration, the grid-search cross-validation algorithm was employed to search for optimal model hyperparameters within the defined grid. The most accurate model configuration was then selected based on the chosen accuracy metric.

The selected models, including Linear Regression, Gradient Boosting, Random Forest, and Bayesian Ridge Regression, were applied to a reduced dataset comprising baseline models. Calibration was performed using grid-search with MAE and MaxError metrics, resulting in two configurations (Table 7). This approach aims to assess the extent to which model performance can be improved to minimize errors.

The table shows that the calibrated models predict relatively similarly across stations. What is interesting is the fact that the linear models (linear regression and Bayes Ridge) can outperform (by max error) GB and RF at the reference (METAR) station and at the Štýřice station, which cannot perform fully on the limited predictors’ space. However, GB and RF perform better at the Hvězdárna station, which generally showed lower correlations between predictors and predicted values, and therefore we expected it to be more difficult to predict. This was confirmed for the linear models, but not for the tree-based ones.

Upon comparison with the uncalibrated models (Figure 6 and Figure 7), it becomes evident that hyperparameter tuning may not be indispensable for the methods used within this dataset (Table 8). Instead, a more significant improvement in model performance was achieved through better predictor selection.

While investigating wind prediction, it was confirmed that hyperparameter tuning holds significance and warrants attention, although it may not be indispensable. Instead, the selection of predictors and the quality of the underlying data serve as pivotal factors. Nonetheless, the importance of hyperparameter tuning might escalate when utilizing alternative predictors. This scenario arises when predictors offer better quality and effectively capture the details of wind.

3.2. Ensemble Evaluation and GB Multi-Output Model

As a subsequent step, we employed an ensemble approach to evaluate the collective performance of all models. We assessed the frequency of occurrences where all models collectively predicted values with a difference greater than 2 °C, cases where none of the models made such errors, instances where only some models deviated, and so forth (Figure 10). This analysis enables us to leverage model agreement as a means to flag values based on the accuracy of their predictions. We operate under the assumption that even under normal conditions, any sudden failure or malfunction would adversely affect the results.

The graphical analysis reveals that employing eight models (excluding Decision Tree, recognized as the most inconsistent) allows for the identification of reliable measurements in approximately 60% of cases. Even with the acceptance of one or two deviating models, the flagged measurements extend to nearly 70–75%. Conversely, approximately 10% of measurements exhibit errors across all models, which, under normal circumstances, might be attributed to inherent inaccuracies in ML models or predictors. However, in scenarios involving convection or inversion, such errors may occur despite accurate measurements.

Subsequently, we develop an ensemble summary to quantify instances where all ML models exhibit a bias exceeding 2 °C at each station. Among the 372 samples examined, the reference station, Brno Airport, emerges as the most accurately predicted. Additionally, we identify cases termed ‘Unique Errors’, where all models correctly predict values at two stations but fail collectively at one (Table 9)

This characteristic offers an ensemble evaluation of station prediction quality and its consistency with others. When applied to a larger dataset encompassing dozens of stations, we anticipate observing robust model agreement across most stations, with unique errors potentially pinpointing faulty ones.

However, training a substantial number of models for each station individually may prove unnecessarily laborious, particularly for multiple stations in close proximity. To address this challenge, we investigate the application of a single Gradient Boosting (GB) multioutput regression model with an even lower MAE than the single-output GB model (Table 10). Multioutput regression presents the advantage of training only one model and simplifying interpretation, streamlining the process despite the dataset’s scale and geographic clustering.

Gradient Boosting multioutput regression showed quite good performance, even outperforming the single-output model. However, it showed a very high maximal error against the ideal calibration in the previous chapter. The results suggest that multioutput regression could be a suitable choice for quality control systems. Further, overall Kendall correlations were calculated for the GB multioutput model predictions (Table 11).

According to Kendall’s correlation of the prediction and measurement results, it would appear that the professional station at the airport is the most predictable, followed by Štýřice and then Hvězdárna. However, the differences are relatively small. What is important is the improvement in correlations over the prediction models. Further, we analyzed the Kendall correlation of the prediction errors from the multi-output Gradient Boosting model with the complete set of predictors (Figure 11).

The results exhibit very low correlation coefficients, indicating that for these predictors and parameters, we can deduce the following:

For the Štýřice station, modeling mixing in the boundary layer may be crucial: Given the correlation of 0.12 with skyl1 (height of the lowest cloud layer), −0.11 with relh, and −0.10 with LI. However, the change in RH at the reference station (relh_change) does not correlate with either station by more than 0.08.
At both amateur stations, it seems that higher error slightly correlates with higher temperatures of the predictive numerical models GFS and ECMWF.
None of the predicted values correlate with the difference between the model (ECMWF or GFS) and the measured value.

An analysis of these results suggests that modeling at the reference station is the most independent, whereas there is still room for post-analysis of the results at the amateur stations.

3.3. ML Outliers Analysis

We investigate whether the single-output GB model error correlates with specific temperature conditions in the ECMWF model, using data from the Hvězdárna station. We define high-error forecasts as those differing by 2 °C from observations, a criterion aligned with ICAO aviation forecasts [36].

The correlation values alone suggest a potential relationship between changes in relative humidity and temperature prediction accuracy. Specifically, the Kendall correlation coefficient between relative humidity at the airport and temperature at Štýřice was −0.33, whereas relative humidity six hours prior showed no correlation (0.0). This prompts the question: can a significant change in relative humidity lead to substantial errors in ML prediction?

Figure 12 indicates a potential trend wherein the GB model exhibits higher errors at higher temperatures. However, discerning any correlation in the histograms proves challenging. In practical terms, leveraging this knowledge for prediction necessitates previous knowledge of temperature changes. Given the apparent complexity of the patterns depicted in the graphs (Figure 12), we aim to employ a machine learning model to detect outliers, recognizing that the relationships may be multifaceted and not solely dependent on one variable. Nevertheless, for machine analysis, it is imperative to define and present this relationship as a predictor for ML models.

Hence, we attempted to implement Gradient Boosting and Random Forest classification models to identify deviations from the ML model in advance. This approach would establish an additional control layer capable of detecting inaccuracies in ML models through predictions. Introducing an algorithm capable of estimating ML methods’ errors would enable the integration of another control mechanism. Essentially, such a model would indicate whether the predicted situation aligns with typical expectations or deviates significantly. The accuracy metrics of these models are summarized in Table 12.

Based on the importance that can be calculated in tree-based models, we can shed light on what the detection models relied on (Table 12). This represents how much the model’s performance would be reduced if we were to remove a particular predictor. Naturally, the nature of both models plays a role. Specifically, we would note that GB can utilize weak learners, while RF uses predominantly strong learners—fully developed decision trees. Also, due to its sequential nature, GB can capture interactions between elements more efficiently than RF. This could lead to differences in the order of importance of features (Table 13), especially if there are important interactions between elements that RF cannot capture due to its independent tree formation process.

The assessment of feature importance revealed a common emphasis placed by both models on ECMWF temperature, followed by the lifted index, albeit with varying degrees of importance attributed to each. This analysis offers insights into the primary determinants of model discrepancies. However, it is crucial to consider the detection success rate, particularly concerning higher errors, which proved to be relatively low. Hence, while the foundation for outlier detection is understood, the priority lies in enhancing the predictive model first, followed by refining the evaluation model. Only upon achieving satisfactory accuracy can the full significance of feature importance be realized.

3.4. Probability of Outlier Prediction

Our last experiment involved utilizing the forecasted probability of significant errors (instances where the model predicted a value differing by 2 °C from the sensor measurement). Several methods offer predictions of this probability, among which we selected Random Forest due to its demonstrated capacity for distinct probability distributions in successful and unsuccessful predictions.

In Figure 13, we can see the distribution of the predicted probabilities. In the ideal case, we would see the highest probabilities for Hits and the lowest for Misses. This could draw the line that could divide reliable and certain classifications from the hard-to-predict situations.

We expected the tuned Decision Tree model to make more diverse predictions based on the data, potentially deviating from the average values. However, both successful and unsuccessful predictions fell within a narrow range, with quartiles (25%, 50%, 75%) differing by only 1%. This indicates poor performance for Decision Trees.

Gradient Boosting showed the opposite behavior. Surprisingly, it assigned even higher probabilities to incorrect (Miss) predictions compared to correct (True) ones.

4. Discussion

This research explored the potential of machine learning (ML) methods for quality control and outlier identification in amateur weather station data. We utilized data from Brno Štýřice and Brno Hvězdárna (amateur stations) alongside Brno Airport (professional station) to train and evaluate various ML models. Our primary objective was twofold:

To assess the viability of ML models for predicting measured values and establishing a quality control framework.
To evaluate ML methods for outlier identification and understanding the causes of higher model errors.

Our research yielded several key findings regarding the application of ML for quality control in amateur weather station data:

Effectiveness of Simple ML Models: The study demonstrates that even simple ML models, trained on data from professional stations and global models (GFS and ECMWF), can effectively predict temperature for amateur stations. This establishes a framework for quality control, with Mean Absolute Error (MAE) within 2 °C for 6 h temperature predictions, meeting ICAO forecasting requirements.
Feature Engineering over Hyperparameter Tuning: Interestingly, feature engineering, particularly simplifying the predictor space (reducing the number of variables), proved more impactful than hyperparameter tuning for linear models like Decision Trees. This suggests that focusing on selecting the most relevant predictors can enhance model performance.
Importance of Model Diversity and Ensemble Approach: No single model emerged as demonstrably superior. This highlights the importance of a diverse ensemble approach for cross-validation and overall robustness. Up to 50% agreement was observed among eight models for temperature prediction, increasing to 75% with a 2 °C tolerance. This ensemble approach provides a more reliable picture of data quality.
Station Quality Assessment through Predictability: The performance of machine learning models in predicting station data can serve as an indicator of station quality. Stations with complex relationships in their data or excessive noise may exhibit lower predictability using ML methods. This opens up the possibility of using ML as a tool for station evaluation.
Proposed 7-Step Quality Control Process: To address quality control and outlier identification, we propose a novel 7-step process utilizing various models and techniques for data quality labeling (Table 14). This process can potentially serve as a foundation for a robust statistical control framework, enabling the flagging of potentially erroneous data points.

By adopting and modifying this approach, we can derive additional measurement characteristics, enabling enhanced flagging of anomalous observations that fall outside the typical error distribution predicted by the ML models.

While this study offers promising results, there are limitations to consider:

Global Model Reliance and Incomplete Predictors: The current approach relies on data from global models, which may not capture all local phenomena. Future work will incorporate data from local NWP models when they become available, potentially leading to improved accuracy and novel correlations with measured variables. Additionally, advanced parameterization techniques for cloud cover and other phenomena will be explored.
Scalability and Automation: The current evaluation process requires significant manual effort. We plan to test these procedures on much larger datasets, with a focus on automating calibration and post-analysis.
Deep Learning Exploration: While this study primarily focused on simpler ML methods, we acknowledge the potential of deep learning approaches. Future research will explore the application of deep learning techniques alongside testing the entire system’s labeling capabilities.
Noisy Data Testing: Machine learning models can struggle with long-term performance when encountering extremely noisy or low-quality data that may be difficult to identify in real-world scenarios. To address this, we plan to test our trained models’ ability to detect synthetically generated noisy data in further evaluations. This will help assess the models’ robustness and identify potential weaknesses in handling data quality issues.

Our findings on the effectiveness of ML models for QC of data from non-professional weather stations within Brno lay the groundwork for potential policy considerations. Firstly, the demonstrably improved accuracy and reliability of urban weather data through ML-based QC suggests the merit of incorporating these techniques into standardized data quality control procedures for urban weather monitoring systems. This could involve establishing guidelines or best practices for deploying ML models for QC with consideration of the varying quality of the sensors [41].

Secondly, the research paves the way for the development of standardized data quality metrics specifically tailored to the complexities of urban environments. These metrics could consider factors like station siting [42], sensor quality [41], and the other challenges associated with urban heat islands. By establishing such standards, policymakers can ensure consistent and reliable data collection across various urban weather monitoring networks.

It is necessary to conduct further research to explore the cost-effectiveness and scalability of implementing these ML-based procedures across diverse urban settings. Additionally, considerations around data privacy and ownership, particularly for crowdsourced data, would need to be addressed within any policy framework.

In conclusion, this research demonstrates the potential of machine learning methods for quality control and outlier identification in amateur weather station data, even on global data with a long lead time. The proposed 7-step process offers a framework for robust data quality labelling. Future work will focus on incorporating local models, advanced parameterization, and exploring deep learning approaches to further enhance the system’s accuracy and scalability. Overall, this study contributes to the field of urban weather monitoring by leveraging the rich data available from amateur stations while ensuring data quality through machine learning-based quality control.

Author Contributions

Conceptualization, D.S. and V.T.; methodology, D.S.; software, D.S.; validation, L.M. and D.S.; formal analysis, L.M.; resources, V.T.; data curation, D.S.; writing—original draft preparation, D.S.; writing—review and editing, L.M.; visualization, D.S. and L.M.; supervision, V.T.; funding acquisition, V.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the defense research project DZRO VAROPS managed by the University of Defence, Brno, NATO—STO Support Project (CZE-AVT-380) Ground Vehicle Ride Quality Testing and Analysis with Complex Terrain and specific research project 2024–26 SV24-210/2 managed by the University of Defence, Brno.

Data Availability Statement

Meteorological data were derived from the following resources available in the public domain at: National Centers for Environmental Prediction/National Weather Service/NOAA/U.S. Department of Commerce, updated daily. NCEP GFS 0.25 Degree Global Forecast Grids Historical Archive. Research Data Archive at the National Center for Atmospheric Research, Computational and Information Systems Laboratory. 2015, https://doi.org/10.5065/D65D8PWK, accessed on 14 January 2024. Measurement data can be provided by the authors upon request and providers’ consent.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhu, Y.; Zhang, S.; Li, Y.; Lu, H.; Shi, K.; Niu, Z. Social Weather: A Review of Crowdsourcing-Assisted Meteorological Knowledge Services through Social Cyberspace. Geosci. Data J. 2020, 7, 61–79. [Google Scholar] [CrossRef]
Meier, F.; Fenner, D.; Grassmann, T.; Otto, M.; Scherer, D. Crowdsourcing Air Temperature from Citizen Weather Stations for Urban Climate Research. Urban Clim. 2017, 19, 170–191. [Google Scholar] [CrossRef]
Nipen, T.N.; Seierstad, I.A.; Lussana, C.; Kristiansen, J.; Hov, Ø. Adopting Citizen Observations in Operational Weather Prediction. Bull. Am. Meteorol. Soc. 2020, 101, E43–E57. [Google Scholar] [CrossRef]
Muller, C.L.; Chapman, L.; Johnston, S.; Kidd, C.; Illingworth, S.; Foody, G.; Overeem, A.; Leigh, R.R. Crowdsourcing for Climate and Atmospheric Sciences: Current Status and Future Potential. Int. J. Climatol. 2015, 35, 3185–3203. [Google Scholar] [CrossRef]
Chapman, L.; Bell, C.; Bell, S. Can the Crowdsourcing Data Paradigm Take Atmospheric Science to a New Level? A Case Study of the Urban Heat Island of London Quantified Using Netatmo Weather Stations. Int. J. Climatol. 2017, 37, 3597–3605. [Google Scholar] [CrossRef]
Hubbard, K.G.; Guttman, N.B.; You, J.; Chen, Z. An Improved Qc Process for Temperature in the Daily Cooperative Weather Observations. J. Atmos. Ocean. Technol. 2007, 24, 206–213. [Google Scholar] [CrossRef]
Napoly, A.; Grassmann, T.; Meier, F.; Fenner, D. Development and Application of a Statistically-Based Quality Control for Crowdsourced Air Temperature Data. Front. Earth Sci. 2018, 6, 118. [Google Scholar] [CrossRef]
Mateo, M.A.F.; Leung, C.K.-S. Design and Development of a Prototype System for Detecting Abnormal Weather Observations. In Proceedings of the 2008 C3S2E Conference on—C3S2E ’08, Montreal, QC, Canada, 12–13 May 2008; pp. 45–59. [Google Scholar] [CrossRef]
Bruns, J.; Riesterer, J.; Wang, B.; Riedel, T.; Beigl, M. Automated Quality Assessment of (Citizen) Weather Stations. J. Geogr. Inf. Sci. 2018, 1, 65–81. [Google Scholar] [CrossRef]
World Meteorological Organization. Guide to Meteorological Instruments and Methods of Observation, 7th ed.; World Meteorological Organization: Geneva, Switzerland, 2008; p. 716. [Google Scholar]
Varentsov, M.I.; Samsonov, T.E.; Kargashin, P.E.; Korosteleva, P.A.; Varentsov, A.I.; Perkhurova, A.A.; Konstantinov, P.I. Citizen Weather Stations Data for Monitoring Applications and Urban Climate Research: An Example of Moscow Megacity. IOP Conf. Ser. Earth Environ. Sci. 2020, 611, 012055. [Google Scholar] [CrossRef]
Gharesifard, M.; Wehn, U.; van der Zaag, P. Towards Benchmarking Citizen Observatories: Features and Functioning of Online Amateur Weather Networks. J. Environ. Manag. 2017, 193, 381–393. [Google Scholar] [CrossRef]
Liu, T.; Zhao, G.; Wang, H.; Hou, X.; Qian, X.; Hou, T. Finding Optimal Meteorological Observation Locations by Multi-Source Urban Big Data Analysis. In Proceedings of the 2016 7th International Conference on Cloud Computing and Big Data (CCBD), Macau, China, 16–18 November 2016; pp. 175–180. [Google Scholar] [CrossRef]
Sun, Q.; Miao, C.; Duan, Q.; Ashouri, H.; Sorooshian, S.; Hsu, K.-L. A Review of Global Precipitation Data Sets: Data Sources, Estimation, and Intercomparisons. Rev. Geophys. 2018, 56, 79–107. [Google Scholar] [CrossRef]
Estévez, J.; Gavilán, P.; Giráldez, J.V. Guidelines on Validation Procedures for Meteorological Data from Automatic Weather Stations. J. Hydrol. 2011, 402, 144–154. [Google Scholar] [CrossRef]
Dejmal, K.; Kolar, P.; Novotny, J.; Roubalova, A. The Potential of Utilizing Air Temperature Datasets from Non-Professional Meteorological Stations in Brno and Surrounding Area. Sensors 2019, 19, 4172. [Google Scholar] [CrossRef] [PubMed]
Štěpánek, P.; Řezníčková, L.; Brázdil, R. Homogenization of daily air pressure and temperature series for Brno (Czech Republic) in the period 1848–2005. In Proceedings of the Fifth Seminar for Homogenization and Quality Control in Climatological Databases, Geneva, Switzerland, 29 May–2 June 2006; pp. 107–121. [Google Scholar]
Newman, M. An Empirical Benchmark for Decadal Forecasts of Global Surface Temperature Anomalies. J. Clim. 2013, 26, 5260–5269. [Google Scholar] [CrossRef]
Bartos, I.; Jánosi, I.M. Nonlinear Correlations of Daily Temperature Records over Land. Nonlinear Process. Geophys. 2006, 13, 571–576. [Google Scholar] [CrossRef]
Smith, D.M.; Cusack, S.; Colman, A.W.; Folland, C.K.; Harris, G.R.; Murphy, J.M. Improved Surface Temperature Prediction for the Coming Decade from a Global Climate Model. Science 2007, 317, 796–799. [Google Scholar] [CrossRef] [PubMed]
Lam, R.; Sanchez-Gonzalez, A.; Willson, M.; Wirnsberger, P.; Fortunato, M.; Alet, F.; Ravuri, S.; Ewalds, T.; Eaton-Rosen, Z.; Hu, W.; et al. Learning Skillful Medium-Range Global Weather Forecasting. Science 2023, 382, 1416–1421. [Google Scholar] [CrossRef] [PubMed]
Werth, D.; Garrett, A. Patterns of Land Surface Errors and Biases in the Global Forecast System. Mon. Weather. Rev. 2011, 139, 1569–1582. [Google Scholar] [CrossRef]
Massonnet, F.; Bellprat, O.; Guemas, V.; Doblas-Reyes, F.J. Using Climate Models to Estimate the Quality of Global Observational Data Sets. Science 2016, 354, 452–455. [Google Scholar] [CrossRef] [PubMed]
Cifuentes, J.; Marulanda, G.; Bello, A.; Reneses, J. Air Temperature Forecasting Using Machine Learning Techniques: A Review. Energies 2020, 13, 4215. [Google Scholar] [CrossRef]
Alves, D.; Mendonça, F.; Mostafa, S.S.; Morgado-Dias, F. The Potential of Machine Learning for Wind Speed and Direction Short-Term Forecasting: A Systematic Review. Computers 2023, 12, 206. [Google Scholar] [CrossRef]
Khosravi, A.; Koury, R.N.N.; Machado, L.; Pabon, J.J.G. Prediction of Wind Speed and Wind Direction Using Artificial Neural Network, Support Vector Regression and Adaptive Neuro-Fuzzy Inference System. Sustain. Energy Technol. Assess. 2018, 25, 146–160. [Google Scholar] [CrossRef]
Chen, Y.; Dong, Z.; Wang, Y.; Su, J.; Han, Z.; Zhou, D.; Zhang, K.; Zhao, Y.; Bao, Y. Short-Term Wind Speed Predicting Framework Based on Eemd-Ga-Lstm Method under Large Scaled Wind History. Energy Convers. Manag. 2021, 227, 113559. [Google Scholar] [CrossRef]
Vassallo, D.; Krishnamurthy, R.; Fernando, H. Utilizing Physics-Based Input Features within a Machine Learning Model to Predict Wind Speed Forecasting Error. Wind. Energy Sci. Discuss. 2021, 6, 295–309. [Google Scholar] [CrossRef]
Wang, L.; Li, X.; Bai, Y. Short-Term Wind Speed Prediction Using an Extreme Learning Machine Model with Error Correction. Energy Convers. Manag. 2018, 162, 239–250. [Google Scholar] [CrossRef]
Spiliotis, E.; Petropoulos, F.; Nikolopoulos, K. The Impact of Imperfect Weather Forecasts on Wind Power Forecasting Performance: Evidence from Two Wind Farms in Greece. Energies 2020, 13, 1880. [Google Scholar] [CrossRef]
Kim, D.-J.; Lee, D.-I.; Kim, J.-J.; Park, M.-S.; Lee, S.-H. Development of a Building-Scale Meteorological Prediction System Including a Realistic Surface Heating. Atmosphere 2020, 11, 67. [Google Scholar] [CrossRef]
Tepanosyan, G.; Asmaryan, S.; Muradyan, V.; Avetisyan, R.; Hovsepyan, A.; Khlghatyan, A.; Ayvazyan, G.; Dell’Acqua, F. Machine Learning-Based Modeling of Air Temperature in the Complex Environment of Yerevan City, Armenia. Remote Sens. 2023, 15, 2795. [Google Scholar] [CrossRef]
Dobrovolný, P. The Surface Urban Heat Island in the City of Brno (Czech Republic) Derived from Land Surface Temperatures and Selected Reasons for Its Spatial Variability. Theor. Appl. Climatol. 2013, 112, 89–98. [Google Scholar] [CrossRef]
Dobrovolný, P.; Krahula, L. The Spatial Variability of Air Temperature and Nocturnal Urban Heat Island Intensity in the City of Brno, Czech Republic. Morav. Geogr. Rep. 2015, 23, 8–16. [Google Scholar] [CrossRef]
Amatérská Meteorologická Stanice. Available online: https://www.meteo-styrice.cz/ (accessed on 16 April 2024).
International Civil Aviation Organization (ICAO). Annex 3 to the Convention on International Civil Aviation: Meteorological Service for International Air Navigation, 20th ed.; International Civil Aviation Organization: Montréal, QC, Canada, 2018. [Google Scholar]
Galway, J.G. The lifted index as a predictor of latent instability. Bull. Am. Meteorol. Soc. 1956, 37, 528–529. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, E.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Louppe, G.; Prettenhofer, P.; Weiss, R.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Wozniakowski, A.; Thompson, J.; Gu, M.; Binder, F.C. A New Formulation of Gradient Boosting. Mach. Learn. Sci. Technol. 2021, 2, 045022. [Google Scholar] [CrossRef]
Lefevre, A.; Malet-Damour, B.; Boyer, H.; Rivière, G. Advancing Urban Microclimate Monitoring: The Development of an Environmental Data Measurement Station Using a Low-Tech Approach. Sustainability 2024, 16, 3093. [Google Scholar] [CrossRef]
Sládek, D. Applicability of Python-based geospatial analysis for operational assessment of weather station deployment. In Proceedings of the 2023 International Conference on Military Technologies (ICMT), Brno, Czech Republic, 23–26 May 2023; pp. 1–8. [Google Scholar] [CrossRef]

Figure 1. Position of measurement stations (red), GFS model point, and used ECMWF model point.

Figure 2. Hvězdárna station sensors for wind (red frame) and temperature and humidity (orange frame) measurements.

Figure 3. Cos and Sin functions applied to the day of the year, generating continuous parameter.

Figure 4. Histograms of (a) Temperature measured on stations and predicted by the models GFS and ECMWF; (b) Wind speed measured at the station Hvězdárna and Brno-Airport (reference) and wind speed predicted by the models GFS and ECMWF.

Figure 5. Kendall correlation of temperature at the Hvězdárna Station (T Hve), Štýřice (T STY), and Brno Airport (tmpc), and variables from global models (LI—Lifted Index GFS, T GFS—Temperature U_/V_wind, Ws—Wind Speed, Wd—Wind Direction, WdDeg—Wind Direction in Degrees), or at the reference station Brno Airport (dwpc—Dewpoint, relh—Relative Humidity, drct—Wind Direction, sknt—Wind Speed, CLD—Ceiling, and all these variables were also taken 6 h prior with the suffix ‘_sh’).

Figure 6. Kendall correlation of the predictors and observed values. Abbreviations correspond to Table 1: values from the airport reference station: tmpc (Temperature), dwpc (dewpoint), relh (relative humidity), drct (wind direction), sknt (wind speed), CLD (ceiling class).

Figure 7. Comparison of the regression models for predicting temperature at (a) Brno-Štýřice; (b) Brno-Hvězdárna.

Figure 9. Comparison of the machine learning models predicting wind speed at station (a) Hvězdárna; (b) Airport Reference station.

Figure 10. Comparison of the calibrated models agreement using (a) 8 models; (b) 7 most accurate models. Predictions of the temperature at Štýřice station.

Figure 11. Kendall correlations of the absolute error of the multioutput GB model and reference station measurements (METAR_diff), Hvězdárna (Hve_diff) and Štýřice (Sty_diff).

Figure 12. Histograms of (a) ECMWF temperatures overall and single-output GB model predictions with absolute error higher than 2 °C; (b) RH change in 6 h before measurements overall and by high errors in temperature modelling.

Figure 13. Predicted probabilities for (a) Hits—temperature prediction error was classified correctly; (b) Misses—temperature difference was classified incorrectly.

Table 1. Overview of predictors, abbreviations, and their role in Figures and Tables, including classification as Predictor, Target Value, or adjusted/aggregated Value for prediction. METAR represents airport data source.

Name [Unit]	Abb.	Role
Temperature Štýřice [°C]	Obs_T0	Target
Temperature Hvězdárna [°C]	Obs_T0_H	Target
Wind Speed Hvězdárna [m/s]	Ws_H	Target
METAR Wind direction [°]	drct ¹	Predictor, value 6 h before
METAR Wind speed [kt]	sknt ¹	Predictor, value 6 h before
METAR temperature [°C]	tmpc ¹	Predictor, value 6 h before
METAR dewpoint [°C]	dwpc ¹	Predictor, value 6 h before
METAR relative humidity [%]	relh ¹	Predictor, value 6 h before
METAR phenomena	wx codes	Predictor, aggregated value 6 h before
METAR Cloud Cover, layer 1–3 [Abb.] ²	skyc1,2,3	Predictor, aggregated value 6 h before
METAR Cloud Base, layer 1–3 [ft]	skyl1,2,3	Predictor, aggregated value 6 h before
Temperature in 2 m GFS [°C]	Temperature_6C_GFS	Predictor
Temperature in 2 m ECMWF [°C]	Temperature_6C_ECMWF	Predictor
Lifted Index (GFS)	LI	Predictor
U-component of wind (GFS, ECMWF)	U_component_of_wind	Predictor
V-component of wind (GFS, ECMWF)	V_component_of_wind	Predictor

¹ If METAR values have the suffix “_sh”, the value represents the value 6 h ago. ² Aviation codes FEW, SCT, BKN, OVC, CAVOK, or NSC were used.

Table 2. Classification of the phenomena into three categories based on the assumed influence on temperature.

Category	Name	Example
0	No significant weather	NSW ¹, CAVOK ²
1	Precipitation	RA, SN, DZ
2	Obscuration	FG, BR (without precipitation)

¹ NSW—No significant weather, ² CAVOK—Ceiling and visibility ok, implying among others for NSW.

Table 3. Classification of the ceiling into three categories based on the assumed influence on temperature.

Category	Name	Example
0	Vertical Visibility or clouds under 1000 ft	VV001, BKN003
1	Clouds between 1000 and 5000 ft	BKN015, OVC030
2	Clouds over 5000 ft or CAVOK	BKN120, CAVOK, NSC

Table 4. ML regression methods and their anticipated value used for initial prediction of the temperature and wind speed.

Purpose	Method	Anticipated Value
Regression	Linear	Simplicity, Interpretability
	Ada Boost	Ensemble, Adaptive
	ElasticNet	Regularization, Flexibility
	Gradient Boosting	Robustness, Accuracy
	Decision Trees	Intuitive, Nonlinear
	Random Forest	Versatile, Robust
	K-nearest neighbors	Non-parametric, Locality
	Lasso	Sparsity, Feature selection
	Bayesian Ridge	Robust, Uncertainty-aware

Table 5. MAE and Max Error of predicting temperature [°C] by NWP models.

Predicted	ECMWF MAE	ECMWF MaxE	GFS MAE	GFS MaxE
Temp. Štýřice	1.44	8.16	3.05	14.25
Temp. Hvězdárna	1.90	8.35	3.59	14.75
Temp. METAR	0.96	6.11	2.55	12.85

Table 6. MAE and Max Error of predicting wind speed [m/s] by NWP models.

Predicted	ECMWF MAE	ECMWF MaxE	GFS MAE	GFS MaxE
Ws Hvězdárna	2.18	9.62	4.77	20.32
Ws METAR	3.45	21.81	2.65	17.40

Table 7. MAE and MaxE for two hyperparameter metrics (MAE and MaxE) for selected models and three stations, predicting temperature.

	Linear		GB		RF		BR
	MAE	MaxE	MAE	MaxE	MAE	MaxE	MAE	MaxE
Temp. Štýřice	1.22	6.63	1.13	7.22	1.18	6.93	1.22	6.63
Temp. Hvězdárna	1.48	6.92	1.36	6.18	1.31	5.86	1.48	6.93
Temp. METAR	0.80	5.89	0.78	6.42	0.84	6.38	0.80	5.88

Table 8. Comparison of calibrated models MAE and Max Error predicting wind speed on the stations Hvězdárna and Brno airport.

	Linear		GB		RF		BR
	MAE	MaxE	MAE	MaxE	MAE	MaxE	MAE	MaxE
Hve	0.54	4.62	0.42	3.71	0.51	4.16	0.53	4.61
METAR	0.92	4.80	0.72	4.33	0.82	4.27	0.91	4.77

Table 9. Number of cases when all the models predicted with larger error than 2 °C from 372 test cases (count) and number of Unique Errors, when all the models were more than 2 °C different at the station but <2 °C by the others.

Predicted	Count	Unique Errors
Temp Štýřice	20	3
Temp Hvězdárna	21	2
Temp METAR	6	0

Table 10. Mean Absolute Error (MAE) and Maximum Error for Gradient Boosting Multi-output Temperature Regression.

	GB
	MAE	MaxE
STY	0.98	6.98
Hve	1.17	8.45
METAR	0.80	5.89

Table 11. Kendall correlations of measured temperatures and their correlation with GB multioutput model and GFS and ECMWF predictors.

Predicted	Predictions—Corr.	GFS Corr.	ECMWF Corr.
Temp Štýřice	0.92	0.81	0.85
Temp Hvězdárna	0.91	0.75	0.90
Temp METAR	0.94	0.84	0.93

Table 12. Accuracy metrics of Random Forest and Gradient Boosting models to classify temperature deviations of the ML models.

Metrics	GB	RF
Accuracy	0.82	0.83
Precision (correct)	0.86	0.86
Recall (correct)	0.96	0.96
F1-score (correct)	0.90	0.91
Precision (high error)	0.26	0.25
Recall (high error)	0.09	0.07
F1-score (high error)	0.14	0.11

Table 13. Top 6 features with the highest Feature Importance of the GB and RF models predicting outliers modelled by the GB multioutput model.

Feature	Importance GB	Importance RF
Temperature_6C_ECMWF	0.12	0.279
Best (4-layer) lifted index	0.10	0.12
Day_of_year_sin	0.08	0.09
RH at reference	0.06	0.08
Wind speed at the reference	0.09	0.07
RH change	0.09	0.04

Table 14. Proposed process of the data quality label prediction.

Stage	Data	Name	Methods	Output
1	Measured data	Exploration	Statistical	Statistical overview
2	Measured + predictors	Baseline modelling	ML	Baseline models
3	Baseline models (BM)	Evaluation	Statistical	Final models (FM) + performance
4	Measured + predictors + BM performance	Tuning	ML	Hyperparameters + predictors
5	Model predictions	Evaluation 2	Statistical	Choice of the labelling strategy and expectable accuracy
6	Statistics of Model predictions	MOS	Statistical	Description of the labelling method
7	ML detection model	Outliers Labelling	ML	Predicted labels for measurements

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sládek, D.; Marková, L.; Talhofer, V. Towards Sustainable Urban Mobility: Leveraging Machine Learning Methods for QA of Meteorological Measurements in the Urban Area. Sustainability 2024, 16, 5713. https://doi.org/10.3390/su16135713

AMA Style

Sládek D, Marková L, Talhofer V. Towards Sustainable Urban Mobility: Leveraging Machine Learning Methods for QA of Meteorological Measurements in the Urban Area. Sustainability. 2024; 16(13):5713. https://doi.org/10.3390/su16135713

Chicago/Turabian Style

Sládek, David, Lucie Marková, and Václav Talhofer. 2024. "Towards Sustainable Urban Mobility: Leveraging Machine Learning Methods for QA of Meteorological Measurements in the Urban Area" Sustainability 16, no. 13: 5713. https://doi.org/10.3390/su16135713

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards Sustainable Urban Mobility: Leveraging Machine Learning Methods for QA of Meteorological Measurements in the Urban Area

Abstract

1. Introduction

2. Data and Methods

2.1. Observations

2.2. Global Models Data

2.3. Data Engineering

2.4. Tested Machine Learning Methods

2.5. Gradient-Boosting Multioutput Regression

2.6. Exploratory Analysis

2.7. Baseline Models

3. Results

3.1. Comparison of the Calibrated Models

3.2. Ensemble Evaluation and GB Multi-Output Model

3.3. ML Outliers Analysis

3.4. Probability of Outlier Prediction

4. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI