1. Introduction
Nowadays, the model of a Smart City (SC) goes beyond an urban space where Information and Communication Technologies (ICT) are applied. The goal is to improve the quality and performance of urban services such as transportation, energy, and other infrastructures in order to reduce resource energy consumption, wastage, and overall costs. SC environments evolve with the application of strategies, resources, and available technologies to improve the quality of life of their citizens and also the operational efficiency of these complex urban systems.
Solid Waste Management (SWM) is one of the main challenges that the SC (and cities, in general) face, especially due to population growth and urbanization. SWM is also a major concern for municipal and national governments in order to protect human health, and to preserve the environment and natural resources. According to recent information from the World Bank [
1], world annual waste generation is expected to increase by 70% from 2016 (2.01 billion tonnes, which means 0.74 kilograms per person and day) to 3.40 billion tonnes in 2050. Consequently, there is an urgent need for more efficient solid waste management in cities. This management involves different stages (mainly, collection, transport, and treatment), which have a significant impact on the involved costs and logistics.
In this context, governments at different levels (e.g., municipal, regional, or national) need from accurate forecasts of waste production in order to develop appropriate policies and provide the corresponding resources for such a goal. Imprecise predictions may lead to increasing costs of waste infrastructures and the deterioration of services for citizens [
2]. However, this forecasting can be difficult and challenging due to rapidly changing demographic and socio-economic factors [
3].
Solid waste prediction can be conducted at different geographic (building, district, municipal, regional, or national) and temporal (e.g., week, month, or year) levels. Country-level studies use previously collected data on total annual waste quantities, waste types, and/or socioeconomic data, which they often make available to international associations [
4,
5]. The applicability of such predictions depend heavily on model assumptions and the quality of the collected data [
6].
Solid waste clustering enables pne to discover similarities and differences among analyzed districts, regions, or countries in waste management. Moreover, they also allow us to extract the relationships between clusters and socio-economic, demographic, and waste generation characteristics. These inherent structures are difficult to observe in the original datasets because of the multi-dimensional nature of data [
7].
The main problem on which this research is focused is investigating what the most adequate machine learning models are for predicting different generated types of wastes using socio-economic and demographic data at the country level. The management actions associated with these prediction results can contribute to improving the condition of the environment in the respective countries. This environmental management policies are one of the most efficient instruments for achieving a sustainable development.
We investigate how open-access data can be used to analyze waste data. In particular, we apply several Machine Learning (ML) models for solid waste prediction at the country level using demographic and socio-economic data. The base of our study is open data collected by the Organisation for Economic Co-operation and Development (OECD), an international organization that aims to achieve prosperity and well-being for the people of the member countries through new policies. Therefore, our study is limited to those OECD countries for which such data are available. We found that most related studies are performed at a municipal level, and there is a lack of works comparing related countries (in our case, OECD countries) with respect to predicting different types of solid wastes a long time.
Next, we summarize the stages in the proposed experimental method as well as the experiments performed in this work:
- 1.
Data pre-processings: perform some transformations on the original dataset.
- (a)
Missing data: replace missing values by linear interpolation of the given data.
- (b)
Feature combination: relate the original data to the areas of countries or to the sizes of respective populations.
- 2.
Co-variance analysis: study dependencies and correlations between the features given in the dataset to select a reduced set of relevant socio-economic features.
- 3.
Data analytics methods: select Machine Learning (ML) regression and clustering algorithms to respectively predict annual solid waste production by countries and to connect countries with similar waste production behavior.
- 4.
Define evaluation error metrics: to evaluate performance capabilities of the compared algorithms.
- 5.
Design of experiments: determine experiments performed in the study
- (a)
Clustering of countries: define groupings according to their waste production and their socio-economic features.
- (b)
Predict waste based on socio-economic data: investigate how selected features can be used to predict annual waste production by countries.
- (c)
Predict waste using the other waste types: investigate how different waste types can be used to predict other types of wastes by countries.
- 6.
Analysis of results: determine main conclusions of the study and propose future work.
The main contributions of our work are:
The waste analysis at the OECD country-level, which would allow us to compare and cluster countries according to similar waste and socio-economic features.
The set of open demographic and socioeconomic data is analyzed in depth to identify the key features that can be used to train the ML models. In particular, the quality and usefulness of the open data provided is discussed in detail.
Different ML methods are compared under the same metrics according to their prediction capabilities.
This paper is organized as follows: First, in
Section 2, we describe the related work.
Section 3 introduces the OECD dataset used, giving an overview of the features involved, the data pre-processings performed, and the evaluation of correlations between features. After that,
Section 4 summarizes the data analytics methods considered in our study—in particular, regression models (Support Vector, Gradient Boosting, and Random Forest) and clustering (
k-means). Additionally, we describe the statistical evaluation metrics used.
Section 5 describes the experimental work performed and analyzes the achieved results. Finally, in
Section 6, we present our conclusions and highlight future research lines.
2. Related Work
Due to the relevance of the considered solid waste prediction problem, there is currently a large body of publications about it [
6,
8,
9]. Many prediction models generally use geographic, demographic, and socio-economic data to estimate future waste production [
9,
10]. On the other hand, some authors also predict using waste data from previous time periods (i.e., by applying time-series analysis) [
11]. The improving of the prediction of solid waste management (in particular, urban household waste) in developing countries can be based on the experience of other more developed countries using “process flow diagrams” and “waste aware benchmark indicators” [
12].
According to Beigl and collaborators [
3], solid waste prediction methods could be grouped into the following categories: correlation analysis, group comparison, single regression, multiple regression, time-series analysis, input-output analysis and system dynamics. Among them, regression analysis techniques are commonly used due to its simplicity. Regression techniques have the problem of lower precision with inaccurate data, as well as a lack of adaptability to new situations [
13] and failure to take into account other factors affecting waste generation [
14].
In recent years, there exists an increasing interest in the use of ML and Artificial Intelligence (AI) techniques for waste forecasting since these techniques present a better adaptability and produce higher prediction performances [
8]. Some of the employed techniques are: Artificial Neural Networks (ANN) [
2,
10,
15,
16], Support Vector Machines (SVM) [
2], Genetic Algorithms (GA) [
17], Expert Systems (ES) [
18], Fuzzy Logic (FL) [
19], and Multilevel Bayesian Framework [
6], among others. ANN and SVM are commonly used to train models for classification and regression tasks. GA and evolutionary algorithms adapt the process of natural selection to obtain optimum results by selecting the best fit data to handle unforeseen conditions. ES simulate expert knowledge and experience in a particular field using a knowledge base and inference rules to reason. FL is a computational approach based on “degrees of truth” that makes it possible to represent and reason with imprecise information. Very recently, and due to provided state-of-the-art results in other research areas, some new works have explored Deep Learning approaches for the considered problem. For example, Cubillos [
11] investigated the application of Long Short-Term Memory (LSTM) recurrent networks to forecast waste generation in a Danish municipality at a weekly periodicity during 2011 and 2018. Moreover, some survey studies focused on identifying ML models to predict solid waste generation based on demographic and socioeconomic parameters [
9,
20].
Predicting solid waste generation using machine learning techniques is a challenging problem [
11]. First, the data used have high variability or data gaps. Second, waste data can be subject to high uncertainty due to unpredictable changes in environmental conditions (e.g., weather, economy, pandemic, and …). Third, the small amount of data, as well as its limited quality, makes it difficult to make accurate predictions for a more distant time horizon. Most of solid waste prediction works have been focused at a municipal scale [
5,
7,
10], but to the best of our knowledge, our work is the first attempt to predict waste production at the scale of country and organization of countries (i.e., OECD) using relatively few data features (including missing data) and few years to predict. Recently, a paper by [
21] examined the link among some indicators such as electricity consumption, urbanization, or economic growth and the environmental pollution for 25 OECD countries in the 1990–2017 period.
To conclude with respect to the application of Machine Learning (ML) models for the solid waste prediction, a recent review work by Guo and collaborators [
22] points out that due to the lack of comparative studies between different models, it is not possible to provide clear guidance for follow-up research or practical application. More comprehensive and detailed model evaluation work needs to be conducted.
In unsupervised classification, clustering is one of the most commonly used analysis techniques to gain insights into the structure of data. Clustering can be used in almost every domain, ranging from banking to recommendation engines, computer vision or document clustering, among many others [
23,
24].
In general, data instances (here the countries) in the same cluster have structurally similar properties. However, data instances from different clusters differ greatly. The main task of clustering is identifying coherent subgroups in data.
Different works on solid waste clustering have been performed at different geographic scales. For example, Guleryuz [
25] compared the waste management performance of 39 districts in Istanbul (Turkey) for the year 2019 by considering the following features: domestic waste, medical waste, population, municipal budget, and mechanical sweeping area. Agovino et al. [
26] analyzed the waste management process by applying cluster analysis to 103 Italian provinces to made suggestions on how to improve waste management activities. Caruso and Gattone [
27] performed waste management analysis in developing countries using the Huang clustering algorithm on mixed data (i.e., both qualitative and quantitative). Many previous works on clustering techniques for solid waste analysis used
k-means clustering [
25,
28]. However, other clustering techniques such as Unsupervised
k-Prototypes Classification [
27], Hierarchical Clustering [
7], Capacitated clustering [
29], or Spatial Clusters [
26], were also applied to the problem.
To conclude with respect to the application of clustering models to waste data, there are very few works, most based on k-means, which have been applied to different geographical scales (i.e., districts, provinces, or countries).
Finally, it is worth mentioning that in most analyzed works the authors use their own datasets which are not always publicly available. As pointed out previously, with respect to the considered problems in this paper, at the level of OECD countries, we have not found related works using the same dataset. In consequence, the results by other works cannot be used for comparison purposes in relation with the success of the prediction methods proposed in this paper.
3. OECD Dataset
As already mentioned, our analysis of waste generation in various countries is based exclusively on the open data provided by the OECD, which can be accessed on their website
https://stats.oecd.org/ (accessed on 10 January 2021). The data are available in CSV format and can be easily further processed.
3.1. Overview
The data comes from 28 years per country (1990–2017) and gives manifold information about 43 OECD-related countries. Thus, in total, we consider 1204 different data instances (i.e., 28 consecutive instances per year between 1990 and 2017 for each country). The countries under consideration differ greatly in terms of their economic strength, infrastructure, and age structure. For instance, the dataset contains data on countries as diverse as India, USA, China, Iceland, Turkey, or Estonia.
In particular, for each country, the following socio-economic features are given, which can be structured into four categories:
Geographical data: the analysis of waste must take into account the geographical characteristics of countries, as there are large differences between OECD countries that could affect waste generation. Geographical features we consider are: the geographical area of the country (AREA) and the proportion of the built-up area (BUILT). These values may characterize the population density, industrialization, and urbanization, which might affect waste production.
Demographic data: these data characterize the population of a country, which might have a major impact on waste generation. The data include the total population (POPULATION), the percentage of the population without secondary education (BELOW_SCND), and the age distribution. In particular, the number of people in certain age ranges is considered (under 20 years of age; between 50 and 65; between 65 and 85; older than 85 years). Data such as education level and age distribution might correlate with the environmental awareness in a country, which strengthens waste prevention.
Economic data: economic indicators can be used to classify a country’s economic strength. It can be assumed that the economic development of a country might have an impact on its waste production. The OECD dataset contains the per capita income (INCOME) and the median of the income (MEDIAN_INCOME), which is the income that is exceeded by 50% and not reached by 50% of the population.
Table 1 shows the features of the OECD dataset for some of the countries in the years 2016 and 2017.
Furthermore, the OECD dataset offers rather detailed information about the waste quantities in each country. In our study, we have considered six different types of waste collected by the OECD. (i) MUNICIPAL: municipal waste in general, originating from commerce and trade, small businesses, office buildings, private households, and public institutions. (ii) HOUSEHOLD: waste from households such as bulky waste, yard waste, and content of litter containers. (iii) RECOVERED: waste that had been recovered, (iv) RECYCLED: waste that has been processed for reuse (v) COMPOSTED: recycled organic waste, and (vi) DISPOSAL: non-recyclable waste that must be disposed.
The waste data provided form the target data for our analysis. An example for some countries is given in
Table 2. Note that there are some gaps in the data, because the OECD dataset is not complete. We discuss how to deal with missing values in the following section.
3.2. Data Quality and Data Pre-Processing
The OECD data are spread over different files, which must first be combined into a single data table, e.g., a csv file. Then, each table row contains the socio-economic and waste data for a particular country in a specific year, i.e., the two columns COUNTRY and YEAR specify the context of the given data.
Note that the COUNTRY column plays a crucial role in further data analysis, as it allows data from different countries to be separated and distinguished. This is of crucial importance, as it can be assumed that waste production follows country-specific patterns. Since most analysis methods cannot deal with categorical values, we encoded the 43 different country names using one-hot encoding, which creates one binary attribute for each of the different countries. In this way, we take into account the knowledge of which data belong to the same country.
Technically, all data processing was performed with pandas, numpy, and Scikit-learn. For label encoding, the Scikit-learn OneHotEncoder was used.
3.2.1. Missing Data
A major issue with the OECD dataset is data quality, particularly the large amount of missing data. Because the total amount of the given OECD data is relatively small, data rows with missing features cannot simply be left completely disregarded. To solve this problem, we replaced missing values by linear interpolation of the given data. There are several facts that can be observed:
For some features, OECD data are only available for a few years. For instance, for all countries, the portion of built-up area (BUILT) was collected only for three years (1990, 2004, 2014).
Fortunately, this feature changes only slowly, and missing gaps can be filled easily by linear interpolation.
Other features are only recorded from a certain year, for instance, the population count (POPULATION), and, in the same way, all data on the age structure of a country are only available from 2005 onward. Because we do not have starting values from the year 1990, it is more difficult to derive previous data.
Here, we need to extrapolate the data to avoid the dataset becoming too small. Note that extrapolation involves greater uncertainty and carries higher risk of producing meaningless results.
Sometimes the data of a certain feature are completely missing for a particular country. Then, we have to decide if the feature is not used as a whole or whether this country is excluded from the analysis due to the missing data. An example is Costa Rica, for which no income data are available. The same problem arises when too few data of a feature are available; then, data interpolation cannot provide meaningful data.
Because this is the case with the feature ’median income’, we have omitted these data.
The OECD does not provide data of all specific waste types for each of the countries.
Table 3 shows for each specific waste type how many of the 43 countries have no data. For some types of waste, especially for household waste and compost waste, data are only for a smaller number of countries available. In particular, there is a lack of data for countries such as India, Indonesia, and New Zealand, but also for China, Canada, and Russia.
3.2.2. Feature Combination
To make countries of different sizes comparable, the data must be related to the area of the country, or the size of the population. Therefore, we calculated for each waste type the waste production per capita. Similarly, the age structure should be related to the number of citizens in a country, i.e., we calculated the percentage of the respective age groups in the total population. Finally, we derived the absolute area in a country which contains any kind of buildings (BUILTAREA) by just multiplying the AREA and the BUILT features.
3.3. Co-Variance Analysis
In this subsection, we take a closer look at the dependencies and correlations between the various features given in the dataset.
Figure 1 shows a heat map with the correlations between selected socio-ecomomic features.
We can observe moderate correlations between the basic features: area, population, and built area. Not very surprisingly, there are also higher correlations between the features regarding the age structure.
It can also be observed that POPULATION and BUILTAREA correlate with the age structure. More precisely, POPULATION is strongly correlated with UNDER_20, OVER_50, and OVER_65, and BUILTAREA is correlated strongly with OVER_85. This holds because all these features correspond to the size of a country. If the population or the built-up area of a country is large, then the absolute number of people of a certain age is also large.
Income and secondary education have a moderate negative correlation and are not strongly correlated with any of the other values.
The correlations between different waste types are shown in
Figure 2.
Two different classes of waste types can be distinguished. On the one hand, waste production can be associated with municipal, household, and disposal waste. By contrast, recovered, recycled, and compost waste correspond to waste reuse. The heat map shows that the correlation within these two clusters are very high. In addition, municipal waste, which comprises total waste, is highly correlated with all other waste types.
Finally, we can calculate the correlations between socioeconomic features and waste types. As can be seen in
Figure 3, some of the socio-economic features correlate very strongly with waste production. In particular, built area seems to have a high impact on all waste types. In addition, population size has a rather strong impact on the waste production. On the other hand, income does not seem to have much influence on the waste numbers.
6. Conclusions
In this work, we investigated the potential of using open data to analyze waste. Multiple experiments were conducted to predict different types of solid waste for OECD countries from 1990 to 2017 using different ML models. Furthermore, cluster analysis was applied to determine similarities among groups of countries presenting similar waste production and recycling characteristics.
The main contributions of this work are: (1) a waste analysis at the OECD country-level, which allows one to compare and cluster countries according to similar waste and socio-economic features; (2) the in-depth analysis of a set of open demographic and socio-economic data to identify the key features that can be used to train ML models (in particular, the quality and usefulness of the open data provided is discussed in detail); and (3) a comparison of different ML regression methods under the same metrics according to their prediction capabilities.
In the performed experiments, we addressed the following goals: clustering of countries based on their socio-economic features as well as their waste production; predicting the waste data from socio-economic data applying ML methods; comparing the performance of the different ML methods and fine-tuning the prediction models by optimizing the model hyperparameters; and analyzing the importance of the particular features. The experiments we conducted have shown that ML methods such as Random Forest Regressor (RFR) can provide very accurate results for waste prediction even based on open data.
The main limitation of our approach is the availability and quality of the available data. As explained in detail in
Section 3 and
Section 6, the OECD dataset has several shortcomings. On the one hand, plenty of data was missing and was imputed to fill the gaps. Of course, more realistic results could be achieved with a complete set of data. In addition, for some of the OECD countries, there are irregular patterns with large discontinuities in the data as shown in
Figure 4 for the recycled waste in Lithuania. Such large jumps in the data do not correspond to the real situation and cannot be accurately predicted. Furthermore, waste analysis could provide more insights if more socio-economic features were available, which unfortunately are not included in the OECD dataset. For instance, it would be interesting to know about the size of households or the proportion of people living in rural or urban areas or climatic conditions. More detailed data from smaller and more homogeneous neighborhoods could also provide more insights into the key factors influencing waste generation.
Future research directions in this work can consist in testing new advanced ML algorithms using Deep Learning—in particular, networks based on Long Short-Term Memory (LSTM) models, which have produced very good results in some predictive problems such as handwritten text recognition or pedestrian trajectory prediction. It is also interesting to remark that performance of considered prediction algorithms depends on the parameters setting of the models. In consequence, it can be convenient to optimize these model parameters using heuristic techniques such as Genetic Algorithms or Particle Swarm Optimization.