Forecasting Population Migration in Small Settlements Using Generative Models under Conditions of Data Scarcity

Zakharov, Kirill; Aghajanyan, Albert; Kovantsev, Anton; Boukhanovsky, Alexander

doi:10.3390/smartcities7050097

Open AccessArticle

Forecasting Population Migration in Small Settlements Using Generative Models under Conditions of Data Scarcity

Research Center “Strong Artificial Intelligence in Industry”, ITMO University, 199034 Saint Petersburg, Russia

^*

Author to whom correspondence should be addressed.

Smart Cities 2024, 7(5), 2495-2513; https://doi.org/10.3390/smartcities7050097

Submission received: 9 July 2024 / Revised: 22 August 2024 / Accepted: 28 August 2024 / Published: 3 September 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Highlights

What are the main findings?

We have proposed a method for creating synthetic data in the context of the task of predicting migration balances.
We investigated the impact of social and economic variables on migration patterns, which vary significantly between small villages and large cities and urban areas, and developed machine learning models to forecast migration balances based on these variables.

What are the implications of the main findings?

Our findings can be utilized in the optimal design of project compositions for the transformation of a territory, including determining the placement of infrastructure, altering migration dynamics, and finetuning the functional purpose of specific areas in order to attain desired, sustainable increases in the levels of certain social and economic development indicators.
Our findings also include a value-based approach, applied within the context of predicting social risks arising from conflicts between the goals and needs of a population and plans for the arrangement of their environment.

Abstract

Today, the problem of predicting population migration is essential in the concept of smart cities for the proper development planning of certain regions of the country, as well as their financing and landscaping. In dealing with population migration in small settlements whose population is below 100,000, data collection is challenging. In countries where data collection is not well developed, most of the available data in open access are presented as part of textual reports issued by authorities in municipal districts. Therefore, the creation of a more or less adequate dataset requires significant efforts, and despite these efforts, the outcome is far from ideal. However, for large cities, there are typically aggregated databases maintained by authorities. We used them to find out what factors had an impact on the number of people who arrived or departed the city. Then, we reviewed several dozens of documents to mine the data of small settlements. These data were not sufficient to solve machine learning tasks, but they were used as the basis for creating a synthetic sample for model fitting. We found that a combination of two models, each trained on synthetic data, performed better. A binary classifier predicted the migration direction and a regressor estimateed the number of migrants. Lastly, the model fitted with synthetics was applied to the other set of real data, and we obtained good results, which are presented in this paper.

Keywords:

migration forecasting; small settlements; synthetic data; collecting data; machine learning for migration

1. Introduction

As a concept integrates information and communication technology, a smart city integrates the latest technological advancements in engineering, urban planning, and economic theory, among others, which are especially improved by using modern artificial intelligence (AI) technologies. In this field, the concept of smart cities is often seen as a strategic approach that aims to improve overall efficiency and effectiveness. Governments and public agencies are adopting this approach in order to differentiate their policies and programs and achieve sustainable development, promote economic growth, and improve the quality of life of their citizens [1].

The main objectives of smart cities are to ensure optimal resource utilization, promote renewable energy sources, enhance safety and environmental quality, optimize the efficient use of human resources, and leverage AI to analyze large volumes of data and control the city infrastructure [2,3,4]. The main driver for development is the adoption of IoT technologies in the form of a range of sensors and transmitters, which collect big data on the environment around them. Then, these data are transmitted to central servers. Once there, the data are analyzed, and machine learning models are developed to solve specific problems [3,5].

The future development of smart cities will be focused on the complete digital transformation of the economy [5,6]. This will involve the use of generative planning models to create urban landscapes [7,8] as well as the optimization of logistics supply chains [9]. Additionally, computer vision technologies will enhance citizen integration with the city [4,10], while natural language models will facilitate communication between citizens and the city’s systems [11].

The problem of forecasting population migration is a crucial aspect of modern city government [12,13]. Understanding the movement of citizens between different locations or groups of locations is essential for the planning of budgets in specific regions. Budgeting can be employed to facilitate the advancement and enhancement of settlements in terms of transportation, landscaping, economic conditions, educational institutions, and other aspects [14].

In the field of migration flow modeling, collecting data on small towns can be challenging due to the unique geographical features and complex data administration systems in these communities. In dealing with forecasting population migration between different regions, we have faced a problem of data scarcity for settlements and territories with small populations [15].

In various countries around the world, there is a varying number of residents required for a settlement to be considered a small one [16,17]. Many countries use a certain population threshold to define urban areas [18]. This threshold can vary significantly, ranging from 200 people (as in Denmark) to 100,000 people (China). Other countries use thresholds of 2000, 5000, and 50,000 [19]. In our framework, we considered the population size below 100,000 people. In Russia, the population of a regional center is on average larger than this number, while the population of municipal district centers and other cities is smaller. By increasing the number of search filters for data, the volume of data available for training is unfortunately insufficient.

At the same time, for local development planning the migration, details, down to the level of a small town or district, are crucial. Most statistics regarding these settlements are not publicly available. Only large cities and regions publish their data, and very few small towns provide the necessary information.

This paper contributes to improving data-driven migration forecasting by means of generative AI to create a synthetic dataset when real-world data are scarce (e.g., for small towns and sparsely populated regions). For this purpose, we found that an ensemble of two models trained on two synthetic datasets predicts the migration balance with real-world data better than a single regressor. The regression in the ensemble is for the migration flow forecast, and the binary classification is to specify its direction. We trained the generative model to augment the initial small dataset for classification and regression problems separately. Then, we used an additional generator, which was modified using the regression loss function and thus helped to improve the synthetic dataset. Finally, we trained the forecasting models to predict the migration balance. Our research may be useful for direct migration forecasting as it facilitates the assessment of features that have an influence on the migration direction and extent.

The remainder of this paper is organized as follows: in the next section, we provide a literature review on the topic under consideration; in Section 3, we analyze the real-world data and identify key features for migration forecasting; in Section 4, we propose the methodology; in Section 5, we formulate the problem and the synthetic approach to augment the initial small cities data; the last section is devoted to experiment study and results.

2. Related Work

2.1. Migration Forecasting

A significant amount of attention in demographic research is focused on the problem of migration [14,20,21]. Such interest in migration is caused by the fact that this process has a high impact on various fields [14,21]. In addition to the typical labor force displacement caused by low-paid and unattractive jobs in host regions, migration can significantly impact a region’s development in terms of “brain gain” or “brain drain” [21] and natural population change (positive or negative) [14,20].

It is worth noting that the issue under consideration is significant not only in terms of international migration but also in terms of domestic migration [22]. Migration can be used as a reflection of the social and economic conditions of certain regions. Therefore, an accurate estimation of migration can also be beneficial for the development of new economic policies. Thus, the development of methods to manage migration flows is highly relevant. The success of any approach involving this issue will depend heavily on the management of key aspects of the problem.

The problem of identifying the main set of factors that motivate people to relocate is in the spotlight in the academic community [23]. Nevertheless, despite the widespread discussions on a certain set of factors, the classification of migration factors based on the push-and-pull approach (proposed by Everette Lee) is commonly used in various studies [20,21]. According to this approach, all relevant factors can be divided into two categories: those that act as push factors, motivating individuals to leave their current location, and those that serve as pull factors, attracting individuals to move to specific destinations. Push factors include economic (unemployment, low welfare), political (lack of freedom, persecution), environmental (harsh climate, pollution), social (discrimination, undesirable lifestyle model), etc. [20,21]. By contrast, pull factors should provide far greater opportunities in terms of education, career, and quality of life when compared to the origin of the individual.

It is challenging for demographic researchers to accurately determine the actual numbers of migration flows [24,25]. Many countries use their own criteria to define the migrant status, e.g., it can be different time ranges [24]. Moreover, in most countries, the fact of migration is officially recognized only when a person changes their so-called “registration” (each resident is registered by his home address) [25]. Since most people do not change their registration, because they often rent an apartment, there is a huge difference between the official place of residence and the actual one. This makes research aimed at finding alternative sources for determining the fact of migration highly relevant, e.g., using Google trends analysis [24,25].

Migration balance, which is the target variable for our models, strongly depends on many social and economic factors. However, these factors have a high level of volatility, which makes it difficult to develop a robust forecasting model. If we consider the history of migration as a time series, the model could not identify any stable patterns or significant autocorrelations. That is why in our research, we investigate the influence of factors instead of considering the migration balance as a time series. Therefore, models such as gray models [26], BP [27] or autoregressive models [28] are not suitable for this problem.

Other approaches include the Bayesian approach to forecasting populations [29]. This approach incorporates Lee–Carter-type models [30] for predicting age patterns, along with the associated measures of uncertainty, for fertility, mortality, immigration, and emigration, within a cohort projection framework. Some studies utilize random forest [31,32], XGBoost [33], and regression methods [34]. Specifically, the authors of [34] considered the challenges and appropriate methods for small-area population forecasting. However, in our research, we constructed a model to predict the migration balance, which differs from population forecasting. Due to difficulties in data collection and analysis, there is a lack of research in this area.

2.2. The Use of Synthetic Data

One potential solution to the lack of historical data is the use of synthetic data [35,36]. Synthetic data allow for the augmentation of the original dataset by adding new instances that are similar in distribution to the original one. Real-world data may not be representative of all possible situations, which can lead to biased or limited generalizations. Synthetic data can provide a wider range of scenarios, allowing the model to learn how to handle a broader variety of situations and inputs.

There are three types of synthetic data: fully synthetic (where all original data are replaced with synthetic data), partially synthetic (where some attributes are replaced with synthetic features, particularly those that are sensitive), and hybrid synthetic (where some records from the original dataset are replaced with synthetic entries or additional synthetic instances are added). In our study, we used the hybrid type.

In addition, synthetic data can be used to hide confidential information. Researchers have shown that by combining features, the so-called “quasi-identifiers”, it is possible to reconstruct the values of certain sensitive attributes [37,38]. Thus, by substituting source data with synthetic data, the overall level of data privacy in machine learning models can be enhanced [39]. Another common use of synthetic data is filtering, such as removing noise or outliers from the data [40,41].

Several studies have also highlighted approaches for using synthetic data in forecasting problems. For instance, the authors of [42] used synthetic data to predict protein function. Specifically, the WGAN method and two-sample classifier test were employed. The researchers’ findings demonstrated a significant improvement in forecasting accuracy. Similar successes have been achieved in [43] for electric vehicles’ demand prediction and in [44] for COVID-19 epidemic forecasting. Some studies also use generation techniques in the problem of migration flows [45,46].

2.3. Generative Methods

There have been numerous studies conducted on the topic of data generation [47,48]. Some of the most prominent examples of generative models include generative adversarial networks (GANs) [49,50,51], variational autoencoders (VAEs) [52], normalizing flows [53], and diffusion models [54]. In the article, we used the state-of-the-art approach of tabular data generation. It is called CTGAN [55]. It handles non-Gaussian and multimodal continuous features and imbalanced discrete columns. The authors proposed a methodology of learning by samples, where at each iteration, a categorical attribute and its corresponding value were selected. Subsequently, all possible combinations of indices and values of categorical features with this index were sorted out, ensuring uniform learning across all possible columns. It is also worth noting the TVAE model [55], like CTGAN, employs the same methodology but is based on the VAE [52] principle.

Several papers are dedicated to generating data for small datasets that use Mega-Trend-Diffusion [49,56]. This approach is based on determining the acceptable range of values for dataset attributes for sampling purposes and then generating synthetic records. The generation process involves selecting points from the acceptable range following a uniform distribution. The acceptance of a new record is determined based on a triangular membership function, which takes the selected point as input and returns a probability value [57]. For this, a random variable must be generated with a uniform distribution in the range

[0, 1]

, and the data point is accepted if the value of the membership function at that point is greater than the generated random variable. Such approaches show good results in terms of training time and different kinds of metrics, such as pairwise correlation distance, NNDR, and DCR [49], among others.

Another popular approach for generating new instances or balancing the labels in a classification problem is the synthetic minority oversampling technique (SMOTE) [58]. This simple but effective technique generates data in the direction of the neighboring points and the original observation, with the addition of a random uniform noise in the range

[0, 1]

. While SMOTE addresses the issue of information gaps in the minority class, it does not account for the distribution of the entire set of minority samples.

3. Initial Dataset and Feature Selection

The nature of the problem requires the collection and processing of data that contain important characteristics of cities. The most complete information in this area can be obtained from the official databases of the government statistics service. It is clear that alternative sources, e.g., websites, social media posts, or city administration services, can provide some of the necessary data. In fact, collecting data from these sources should be a priority, as data from different sources make the results of the analysis more trustworthy. In most cases, the source of such data has almost no alternatives, since only the authorities of a given city know the real values of key statistical indicators and pass them on to federal services.

The real-world data collection method that we use in our research involves the collection of a large volume of data from each city in the following categories: demography, living standards and consumption, and Production. Each of these categories has a certain number of features (attributes). For example, the category “living standards” includes characteristics such as retail turnover, live area per capita, the number of doctors per person, etc. The category “production” includes data on the volume of annual construction, the number of apartments built, etc. Due to the fact that in many studies the push factors are of a socioeconomic nature [14,20,21,23], these data can be considered a basis for migration forecasting. Secondly, the category “demography” includes an indicator of net migration, which greatly simplifies the task, because most of the necessary data for developing a predictive model are available in one collection. For a detailed description of the collected data, see Section 6.1.

Figure 1 presents the schema of the data that we generated for our experiments. Their sources are the published reports of the municipal authorities of several towns and districts in Russia. For other countries and regions, similar factors could be available in the statistical reports or even in some global data collection databases like the Atlas of Urban Expansion (AUE) and the Global Municipal Database (GMD) (https://globalmunicipaldatabase-guo-un-habitat.hub.arcgis.com/, accessed on 20 August 2024).

This list of indicators, of course, is not exhaustive. It is used for research purposes. In practice, it can certainly be expanded significantly, depending on the availability of real-world data and the peculiarities of certain countries and districts.

Table 1 shows the key statistics of all attributes from the dataset for small cities. We provided an explanation of these attributes and specified the units of measurement in parentheses. Statistics are included as minimum, maximum, and median values.

However, the collection process has some drawbacks. The main issue is related to the fact that this collection represents only cities with a population of over 100 thousand. On the one hand, this means that the dataset for future processing will be smaller than expected since the majority of cities do not fit this condition. On the other hand, this limitation prompted us to look for some solutions for working with small settlements to fully achieve the goal of the study, which was to forecast migration flows in any settlement, even small ones. It should be noted that data on net migration was only available from 2010 onwards. This also reduced the total number of examples in the dataset.

As a result of data processing, 13 socioeconomic features were selected, and the total number of examples was more than 2000. One of the critical steps after selecting features is to evaluate their relative importance. The decision trees can be used to perform an initial study to evaluate the predictive ability. This approach is known to be easy to understand and fast in terms of the training process, making it well suited for such an investigation. Since the collected dataset was not a large one for machine learning (little more than 2000 examples), it was sensible to perform a series of training cycles to obtain an average evaluation on a testing sample.

We used the feature importance analysis to identify the most significant attributes. We utilized the Shapley value-based feature importance algorithm. The Shapley value (SHAP value) for a specific feature is the difference between the expected model output and the partial dependence plot at the feature’s value [59]. The sum of the SHAP values for all input features will always be equal to the difference between the baseline output of the model and the actual output of the current model for the given prediction [60]. In Figure 2 the X-axis shows the average absolute SHAP values for all attributes in our dataset, and the Y-axis shows the attributes’ names; the greater the absolute value, the greater the impact this attribute has on the model’s output. We sorted all values in order of non-increasing significance of the features.

Figure 2 shows that the most significant features are the commissioning of residential buildings (consnewareas), the average salary (avgsalary), retail turnover, the longitude (lon), and the volume of annual construction (conscap). While the importance of the average salary is beyond any doubt, because the expected income is definitely a crucial factor in the decision to move [23], the significance of the construction features is not so clear at first glance.

The importance of the commissioning of residential buildings, as well as other features from the construction category, is directly related to the approach for defining the fact of migration. As we have already mentioned, the official change in the place of residence (the fact of migration) for any citizen occurs only when a person changes their registration. Thus, when a person buys a new house or apartment, his or her actual place of residence will be officially recorded (due to new registration), which does not mean that the person has moved recently, because of the possibility of renting an apartment. Thus, in such a system, the housing market is part of the identification of the actual place of residence. The other important factor could be related to the use of foreigners in construction. This means that settlements with a high volume of construction will experience an increase in international migration.

When constructing a model for small cities, the most significant features are the annual volume of investment (invests), the population size (popsize), and the annual industrial capacity (factoriescap). All signs point to the intuitive understanding that it is necessary to invest money in cities in order to attract more people.

Figure 3 shows the correlational dependencies between the features. We note that retail turnover has the most significant influence on the dependent variable, migration balance (saldo), for large cities, while investments have little impact, as in large cities, investments are always at a high level, and the focus is shifting toward improving the city and housing sector. For small cities, we note the negative effect of investment amount on migration balance. This effect can be explained by the fact that investments are measured over the past year. Therefore, in reality, with an increase in investment this year, there is a decrease in the number of citizens leaving or an increase if the balance is positive. For example, by computing the correlation with year change, we achieved

0.77

by choosing several small cities.

The analysis shows that the relative importance of features in the processed set is quite reasonable. This provides a strong basis for conducting a series of experiments to assess the ability to forecast migration based on the collected data and selected features using a machine learning approach.

It is important to note that the dataset contains only relatively large settlements (population size is over 100 thousand), while small settlements are of great interest from the point of view of the urban planning platform and also the overall goal of this research as well. Unfortunately, it is very difficult to collect such data, because the only source could be textual reports that are published on the authority websites of a particular settlement. In this regard, it was decided to try to forecast net migration in small settlements using the model that was trained on the large ones. To test this idea, several dozen small settlement examples were found and processed.

One of the critical reasons that limits the accuracy of the model is probably the small size of the training dataset. The solution to this problem can be achieved by generating synthetic data, which could improve the quality of model fitting and lead to more accurate forecasting of net migration not only for small settlements but also for settlements of any size.

4. Methodology

Our goal was to predict the migration balance based on socioeconomic indicators. This task prompted us to use the regression problem. As we mentioned before, we could not carry out proper data collection from small settlements. After the preprocessing procedure, we obtained the dataset

D_{r}

, where the lower index relates to the regression problem. This was the initial dataset used to predict the balance of population migration in small settlements. Specifically, the migration balance (saldo) attribute in this dataset is represented as a real number.

We conducted several experiments and found that regression models could not recognize the dependencies between regressors and the dependent variable “saldo”. Therefore, we decided to solve two tasks separately: (i) the identification of the sign of migration balance and (ii) the determination of the amount of migration balance. The experiment results are provided in Section 7.

The first task was to determine whether there was an outflow or an inflow. For this, we had to determine the sign of the absolute value of migration balance (i.e., to determine if the migration is inbound or outbound) by solving the classification problem. In this instance, the issue of binary classification arises, which can be addressed using well-established techniques, such as logistic regression, decision trees, or ensembles of such models.

In the next step, we trained a regression model using the same set of features but with the dependent variable being the actual value from

R

. Various options were explored in this work, including linear regression, random forest, and gradient boosting, among others.

Figure 4 shows the general pipeline of the proposed method. The blue section corresponds to the classification problem, and the red section represents the regression problem.

5. Problem Formulation and Synthetic Data

In a more formal way, we consider two datasets

D_{r}

and

D_{c}

which consist of

{X_{i}, y_{i}}_{i = 1}^{N}

and

{X_{i}, l_{i}}_{i = 1}^{N}

where

X_{i} \in R^{h}

is the feature vector,

y_{i} \in R

is the dependent variable,

l_{i} \in {0, 1}

is the binary class, and N is the number of records. The dimension of feature space in our case is equal to

h = 13

. The dataset

D_{c}

corresponds to the migration balance, which takes the form of labels in the migration balance attribute. Label 0 corresponds to an outflow of residents, while Label 1 corresponds to an inflow of residents.

The number of records in

D_{c}

and

D_{r}

is equal to N, and we assume that this number is not large enough to train a predictive model. In our case,

N < 50

. Here, we used the concept of synthetic data to augment the initial one.

In the first step, we trained the predefined generative model to cover the support of the joint distribution of

D_{c}

. After the generative model was trained, we developed the generator, which took the noise vector as input and produced the synthetic sample

{\tilde{D}}_{j}

, where

j \in {1, \dots, M_{c}}

and

M_{c}

is the number of synthetic samples for the classification problem. After that, we trained the classifier choosing

{\tilde{X}}_{j}

from

{\tilde{D}}_{c}

for the train attributes and the last column of

{\tilde{D}}_{c}

for labels

{\tilde{l}}_{j}

, where

j \in {1, \dots, M_{c}}

.

To train the classifier, we took the partition of

{\tilde{D}}_{c}

on train and validation samples with a ratio of 8:2. After several experiments, we chose the XGB classifier for our purposes. For hyperparameter tuning, we used the Bayesian optimization framework Optuna [61]. As for the objective function, we used the

F_{1}

score, which was computed by cross-validation with five folds.

After we trained the classifier and selected the optimal hyperparameters, we applied the classifier to a test sample, which was a subset of the small settlement dataset. The result is the predicted labels

{\hat{l}}_{i}

, where

i \in {1, \dots, L}

, and L is the test sample size.

The second step involved the regression problem for

D_{r}

. We generated synthetic data

{\overset{˘}{D}}_{r}

and trained the regression model. After we received the predicted labels

{\hat{l}}_{i}

and predicted values of the balance

{\hat{y}}_{i}

, we combined the results by adjusting the regression model predictions, specifying the signs in accordance with those predicted by the classifier.

The issue we encountered is that, in contrast to the classification problem, in regression, training on synthetic data, despite showing improvement, did not yield the desired outcome. Therefore, we employed an additional approach here.

Specifically, after we generated the synthetic dataset

{\overset{˘}{D}}_{r}

using the arbitrary generative model, we used another generative model, Wasserstein GAN (WGAN) [62], with an additional modification. The modification was to add an additional loss function to the generator, which would tell the generator how to transform synthetic samples to obtain a better score in the regression problem. For that, we first trained the neural network

ψ_{ζ}

to predict

y_{i}

, where in

ζ

are stored the learnable parameters. The network

ψ_{ζ}

has several fully connected layers and a ReLU activation function between them, except for the last layer.

The idea behind GAN was to find the desired measure with the absolute continuous density function

q (x)

in

(R^{h}, B (R^{h}))

, where

x \in R^{h}

, which would be close to the initial distribution of the data

{\overset{˘}{D}}_{r}

with the absolute continuous density

p (x)

. We can find this measure as a push-forward measure under the preimage of transformation

g_{θ}

(generator) and some simple distribution

γ

, i.e., as

γ \circ g_{θ}^{- 1}

, where

γ

is the probability measure on

(R^{d}, B (R^{d}))

and

θ

is referred to the parameters of the neural network. The second network, called discriminator

d_{η}

, was used to determine the quality of synthetic data, where

η

is also referred to as the trainable parameters. Originally, the discriminator produces the probability, but WGAN [62] is the modification of the basic GAN [49], where the discriminator produces the real number. Therefore, the authors proposed the new loss function based on the Kantorovich–Rubinstein duality [63] as follows:

W (p, q) = sup_{∥ d_{η} ∥_{L} \leq 1} {E_{p (x)} [d_{η} (x)] - E_{q (x)} [d_{η} (x)]},

(1)

where

E_{p (x)} [\cdot]

means the mathematical expectation under the measure, which corresponds to the density function

p (x)

.

Finally, the discriminator has the following loss function:

L_{d} = max_{η} {E_{x \sim p (x)} [d_{η} (x)] - E_{z \sim p_{g} (x)} [d_{η} (g_{θ} (z))]},

(2)

where

p_{g} (x)

is the density function, which corresponds to the generator. Our modified loss function for WGAN has the following form:

\underset{L_{g}}{\underset{︸}{max_{θ} {- E_{z \sim p_{g} (x)} [d_{η} (g_{θ} (z))]}}} + \underset{L_{r e g}}{\underset{︸}{min_{θ} {M S E (ψ_{ζ} (g_{θ} {(z)}_{[0 : - 1]}), \tilde{y})}}},

(3)

where

g_{θ} {(z)}_{[0 : - 1]}

means the columns range from 0 up to the column

- 1

, and

M S E

is the mean squared error. The result of our modified WGAN is the synthetic dataset

{\tilde{D}}_{r} = {{\tilde{X}}_{j}, {\tilde{y}}_{j}}_{j = 1}^{M_{r}}

, where

{\tilde{X}}_{j} \in R^{h}

and

{\tilde{y}}_{j} \in R

.

Finally, we trained the regression model on

{\tilde{D}}_{r}

and evaluated it on the test sample, achieving

{\hat{y}}_{i}, i \in {1, \dots, L}

. Then, we adjusted

{\hat{y}}_{i}

with

{\hat{l}}_{i}

and obtained the final target prediction.

6. Experimental Study

6.1. Data Description

We investigated migration forecasting in small settlements in the Russian Federation. We had a wide range of databases available for large cities with populations exceeding 100,000 people. The main socioeconomic indicators of cities were found in the database of Rosstat (https://rosstat.gov.ru, accessed on 20 August 2024), which has been published from 2004 up to now (it is published once every two years). In total, 13 socioeconomic features were selected, and the total number of examples was more than 2000. The dataset contains information for approximately 180 cities from 2010 to 2022. The migration balance target for each city was shifted forward by one year for each data sample for forecasting purposes. The analysis of this collection highlighted its high relevance to the objectives of the current study. Data on large cities provide information on various factors such as macroeconomic indicators, migration patterns, and the overall level of regional development. Based on these data, indicators that were significant for forecasting migration balances were identified. Based on the selected criteria, we manually collected data from the websites of relevant government agencies (http://radm.gtn.ru, https://new.mo-siverskoe.ru, accessed on 20 August 2024) and other sources that publish open access information about small regions and cities (https://old.citylifeindex.ru/database?pageType=CITIES, accessed on 20 August 2024).

We also prepared both datasets for small and large cities by removing outliers. Due to the lack of sufficient data, we removed outliers within the interquartile range multiplied by a factor of three on both sides.

6.2. Synthetic Data Quality Assessment

In this section, we present the results of the best generative model that can approximate the multivariate attributes’ distribution and augment the datasets with additional records for both the regression problem

D_{r}

and the classification problem

D_{c}

.

We used three predefined generative models on our data and the best one was chosen for the proposed pipeline. We trained the state-of-the-art framework for tabular data generation, CTGAN [55], and its generalization, TVAE [55], which has the same underlying idea but is applied in the context of a variational autoencoder. The third model was Copula GAN (https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/copulagansynthesizer, accessed on 20 August 2024). It is also a variation of the CTGAN model, which utilizes the CDF-based transformation applied by Gaussian copulas.

We estimated the synthetic data quality using several metrics, namely the correlation similarity between all attributes [64], Jensen–Shannon divergence [65], the total variation distance for the target attribute (saldo) for the classification dataset

D_{c}

, and classifier two-sample test (C2ST) [66], to identify the ability of machine learning models to find the difference between synthetic and real data. We trained all models for 5000 epochs and generated 100 synthetic datasets of the length N for statistical significance, and for each iteration, we computed the identified metrics. Finally, we calculated the average values, which are presented in Table 2.

The underlying principle of C2ST was to label real data with 0 and synthetic data with 1. After that, we had to train a classifier to find the difference between the two data distributions. The optimal value for the statistic in the two-sample classifier test was

0.5

, as this indicated that the synthetic and real data had a similar distribution. In other words, the classifier could not distinguish between the two datasets and generated a random variable following a Bernoulli distribution with a probability of

0.5

.

For the null distribution in C2ST, we used the normal distribution

N (\frac{1}{2}, \frac{1}{4 N})

[66]. The null hypothesis states that the two distributions are equal, and alternatively, that they are distinct. In brackets, we provide the p-value. We chose

0.05

as the significance level at which the null hypothesis is rejected, and the statistic value should be greater than

0.5

. If p-value is greater than the significance level, the null hypothesis cannot be rejected.

The results show that the TVAE model is appropriate to generate data on small cities better than others by taking the smallest Jensen–Shannon divergence and the closest correlation similarity. For the categorical attribute saldo, we also note that the total variation is better in the case of TVAE. C2ST does not provide the evidence to reject the null hypothesis about distinction, but it also does not say that we need to accept the null hypothesis. By contrast, the statistic value is closer to

0.5

in the TVAE case.

Figure 5 shows the p-values for the Kolmogorov–Smirnov two-sample test between real and synthetic data by each column. The null hypothesis states that the two distributions are equal. If the p-value takes a value less than the significance level, then we reject the null hypothesis and can state that one sample differs from another. In our case, samples are the columns. The black horizontal line indicates the significance level, and the bars indicate the p-values. We set the significance level at

0.05

.

In the CTGAN model, we can reject the null hypothesis about the equal distributions for several columns. This implies that there are several attributes in the synthetic dataset that differ significantly in distribution from the attributes in the real dataset, such that we can discuss the difference between their distributions. By contrast, TVAE and Copula GANs showed much better results. TVAE differs from the real dataset in the volume of annual construction (conscap). For the Copula GAN, the columns are longitude (lon) and the number of unemployed citizens (unemployed).

Based on the experimental study of training various generative models on our dataset, the TVAE was chosen for small cities’ data augmentation.

6.3. Migration Forecasting Assessment

Table 3 shows the results of the classification problem on different datasets of the test data. The test data were not used in the training process of the generator and were set aside for evaluating the performance of the model. We chose the

F_{1}

score [67] and the ROC AUC score [68] to assess the quality of the classifier. The classifier trained on the synthetic data

{\tilde{D}}_{c}

demonstrated the best performance. Due to the increased coverage of the data distribution in the initial small dataset, it was possible to enhance the data and generate new samples, which improved the overall quality of the results.

Figure 6 shows the qualitative analysis of the classification problem. We plotted the receiver operating characteristic (ROC) curve, which illustrates the performance of a binary classifier model (Figure 6a). The X-axis corresponds to the false-positive rate (1 - specificity), and the Y-axis corresponds to the true-positive rate (recall). Classifiers that produce curves that are closer to the upper left corner of the graph indicate better performance. By contrast, a random classifier would be expected to produce points that lie along the diagonal.

Figure 6b shows the feature impact using the Shapley (SHAP) values. Positive Shapley values indicate a positive impact on the model’s output; in other words, the higher the value on the right side of zero, the greater the magnitude of the migration balance. Conversely, a lower Shapley value corresponds to a lower forecast value. The blue color represents low values of the attribute, while the red color represents high values. A high concentration of values in a particular area suggests that there are instances in the dataset that influence the model’s predictions in a similar manner. The Y-axis corresponds to the features that are arranged in accordance with feature importance by the average absolute SHAP values (mean

(| SHAP value |)

). As the model deals both with positive and negative migration balance, these values represent both the pull and push scenarios in the migration process.

For instance, we observe that low investment levels increase the migration balance, whereas high levels underestimate it. We observed this effect in the correlation plot (Figure 3). Additionally, we note a clear clustering of data at three specific points. We may also notice that a lower retail turnover seems attractive to migrants because it is measured in money values, whereas a higher retail turnover often means more expensive goods for the customers. However, it is not necessary, as evidenced by the blue dots mixed with the red ones on the negative side of the impact plot.

Table 4 shows the results of the regression problem on the real test data. We trained four regression models, namely linear regression, XGBoost, random forest, and multilayer perceptron (MLP), on four datasets: real small cities’ data, real large cities’ data, synthetic for small cities, and synthetic for small cities with classifier correction. We calculated the MAE metric in terms of people, which directly indicated the number of individuals who departed or arrived in a specific settlement. For statistical significance, we shuffled the dataset and divided it by the training and test samples using a ratio of 8:2. We repeated this procedure 100 times. We also determined the standard deviation and 95% confidence interval for the mean error. The MSE was calculated in terms of normalized values. In Table 4, we show the average results weighted by the population size. The relative metrics like mean absolute percentage error (MAPE) did not suit our task because the value of the migration balance determined using the number of migrants may be close to zero, which usually distorts the evaluation. The symmetric MAPE (sMAPE) only partially solves this problem. We did not use the statistical coefficient of determination because the available real dataset was quite small, and questions may arise about the statistical significance of the obtained results.

Figure 7 shows the qualitative analysis of the regression problem after specifying the signs. Figure 7a corresponds to a scatter plot, where the X-axis represents the true values of the migration balance, and the Y-axis represents the predicted migration balance both measured in the number of migrants. In an ideal situation, we must determine the identity line (red color), which corresponds to the complete coincidence of the forecast with the real migration balance. We employed the ordinary least square (OLS) method to construct a regression model, where we used the true values for the independent variables and the predicted values for the dependent variable. We also utilized the confidence interval (CI) estimation for regression values, which has the following form:

CI (y) = α_{0} + α_{1} \cdot y \pm t_{1 - \frac{α}{2}}^{(L - 2)} \cdot \hat{σ} \cdot \sqrt{\frac{1}{L} + \frac{{(y - μ_{y})}^{2}}{\sum_{i = 1}^{L} {(y_{i} - μ_{y})}^{2}}},

(4)

where

t_{1 - \frac{α}{2}}^{(L - 2)}

is the critical value of the Student distribution at level

α

and

L - 2

is the degrees of freedom,

\hat{σ} = \sqrt{\frac{\sum_{i = 1}^{L} {({\hat{y}}_{i} - y_{i})}^{2}}{L - 2}}

is the standard error of estimate,

μ_{y} = \frac{1}{L} \sum_{i = 1}^{L} y_{i}

, L is the test sample size, and

α_{0}, α_{1}

are the parameters of the OLS regressor.

The identity line must be covered by the confidence interval, which was constructed by the OLS line in the ideal case.

Figure 7b shows the feature importance diagram, the same as for the classification problem. The investment attribute has the same structure as in Figure 6b but with one large cluster. A poor separation can be observed in the magnitude of the feature for the annual industrial capacity, the food service turnover, and the average salary.

7. Discussion

The results of the experimental study (Section 6) show that the proposed method of synthetic data generation allows for obtaining datasets on social and economic states for model training regarding both classification and regression problems. As can be seen in Table 2, the generated data produced by the TVAE method are similar to the real data with fairly good metrics of a correlation similarity of >0.95 and a Jensen–Shannon divergence of about

0.3

. We used CTGAN and TVAE because they are popular and widely used and more lightweight than diffusion models or other advanced approaches. However, to increase the diversity of synthetic data, it is possible to use more advanced models, which will most likely only increase the overall effectiveness of migration forecasting.

We used data on towns from completely different regions of Russia. These regions differ greatly in terms of socioeconomic development. This gives us reason to believe that the main ideas of the research could be successfully applied to forecasting migration flows in other places as well. The motivation for our study was the lack of data on migration balance in small settlements and the difficulties associated with obtaining them. Currently, the collected data are insufficient to correctly assess the generalizability of the developed models. As new data are collected, we plan to investigate this in future studies.

The use of an ensemble of a classifier and a regressor for migration balance forecasting can be explained by considering Figure 6b and Figure 7b. The Shapley value analysis of the features, including the number of unemployed citizens, the average salary, the food service turnover, and the volume of annual construction, indicates that a higher value for each of these features may lead to either a positive or negative impact on the regression results, depending on the specific circumstances. The separate classification model can correct this ambiguity.

The scatter plot in Figure 7a indicates that our adjusted regression model can accurately predict the migration balance for small settlements, with only minor deviations from the identity line in several instances in the test sample.

We can also notice the significant level of heteroscedasticity both in the initial data, where the greater standard deviation of the migration balance follows the greater population size of the settlement, and in forecasting models, where the prediction error grows according to the migration balance. Experiments show that data normalization by the population size hardly helps with this problem, but due to synthetic data for model training, this effect can be reduced, which affects the improvement in the quality of the forecast. We confirmed this by training two similar models of gradient boosting. One uses the real-world dataset of large cities, while the other uses the synthetic data built on it. The test data are the same set from real-world factors. TheMAE for the first model is 7391, while for the second, it is 6790 people; thus, the quality improvement is about

8 %

A thoughtful analysis of the requirements leads to the idea of developing a universal migration forecasting model that would be useful for any type of settlement. Besides direct forecasting on real (current) data, the general idea is in the ability of such a tool to test different values for each location’s feature, which is in fact a kind of simulation that can be used with a certain degree of reliability to assess the impact of some policy on the migration flow in a certain settlement. As we have already revealed, there are two types of migration: one for large cities and agglomerations and the other for small settlements. So, the method used for building a universal model is a classification that can recognize the type of migration, followed by the application of the pretrained model or ensemble according to the predicted type.

Migration balance forecasting in small towns itself is essential for understanding the demographic and economic shifts. It helps in identifying areas of growth and helps inform policies and investments that can support the development of these communities. Moreover, to address social and economic factors, the models that we present in this research can provide tools for evaluating plans and projects for territory development, as they can show if some transformation will make the settlement attractive or repulsive for its inhabitants. Last but not least, the benefit of our method is its ability to analyze the migration balance at a small scale, which makes it useful for a wide range of local problems and challenges.

We applied an adaptive prediction model based on gradient boosting to determine the extent to which it is possible to obtain a high-quality forecast using our data. Another important aspect of this research is the hybrid approach to migration forecasting compared with a single regression model. The forecast error is already comparable to the accuracy of the source data, which is sufficient for the given task. Therefore, improved models have the potential to outperform the baseline model. Furthermore, our estimate can serve as the basis for all other models.

The advantage of the hybrid model over the single model approach is due to the fact that we are dealing with processes of a slightly different nature. We are now moving toward a hybrid approach that combines the three models. In further work, we plan to show that the same features have completely different levels of importance when it comes to predicting inflow, outflow, or even direction (classifier). The classifier determines the direction that minimizes the heterogeneity, and the regressor identifies the dependencies. This is the main reason why a hybrid model could be a better solution.

8. Conclusions

In our research, we investigated the problem of predictive models’ training in case of insufficient amount and quality of training data on social and economic states in small settlements. The proposed method of synthetic data generation applied to the problem of migration balance forecasting showed that the models trained on this artificial dataset performed the forecasting with higher accuracy than similar models trained on a small and incomplete set of real-world data.

The other finding of our work concerned the problem of migration balance forecasting. It was found that predicting the sign of the balance, i.e., the direction of migration, and the number of migrants separately helps to increase the prognosis quality, and the artificial training datasets should be different for the classification problem of migration direction and the regression problem of the number of migrants, whereas the real data for making a prognosis may be the same.

Moreover, during our research, we investigated the influence of social and economic factors on migration tendencies, which are significantly different for small settlements compared to large cities and agglomerations.

The obtained research results and developed models will be useful for value-based modeling of community development programs. They are intended for the optimal formation of a series of projects (events) for the transformation of the territory, including the placement of infrastructure facilities, changing the factors of migration dynamics, and the functional purpose of individual territories in order to achieve the target values of social and economic development indicators. The value of the task is ensured within the framework of forecasting the social risks caused by conflicts of goals and the needs of the population and measures to transform the territory. When implementing the models, it is assumed that these cognitive technologies will be used to predict social risks.

Author Contributions

Conceptualization, A.A., K.Z., A.K. and A.B.; methodology, K.Z., A.A. and A.K.; software, K.Z., A.A. and A.K.; validation, K.Z. and A.A.; data curation, A.A.; writing—original draft preparation, K.Z.; writing—review and editing, A.K. and A.B.; supervision, A.K. and A.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Analytical Center for the Government of the Russian Federation (IGK 000000D730324P540002), agreement No. 70-2021-00141.

Data Availability Statement

The original data presented in the study are openly available by the 614 links https://rosstat.gov.ru, https://old.citylifeindex.ru/database?pageType=CITIES. The processed 615 data is stored https://github.com/AlgoMathITMO/Migration-forecasting/tree/main/Data. all accessed on 20 August 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Albino, V.; Berardi, U.; Dangelico, R.M. Smart cities: Definitions, dimensions, performance, and initiatives. J. Urban Technol. 2015, 22, 3–21. [Google Scholar] [CrossRef]
Hammoumi, L.; Maanan, M.; Rhinane, H. Characterizing Smart Cities Based on Artificial Intelligence. Smart Cities 2024, 7, 1330–1345. [Google Scholar] [CrossRef]
Lombardi, P.; Giordano, S.; Farouh, H.; Yousef, W. Modelling the smart city performance. Innov. Eur. J. Soc. Sci. Res. 2012, 25, 137–149. [Google Scholar] [CrossRef]
Ho, G.T.S.; Tsang, Y.P.; Wu, C.H.; Wong, W.H.; Choy, K.L. A computer vision-based roadside occupation surveillance system for intelligent transport in smart cities. Sensors 2019, 19, 1796. [Google Scholar] [CrossRef]
Neirotti, P.; De Marco, A.; Cagliano, A.C.; Mangano, G.; Scorrano, F. Current trends in Smart City initiatives: Some stylised facts. Cities 2014, 38, 25–36. [Google Scholar] [CrossRef]
Kirimtat, A.; Krejcar, O.; Kertesz, A.; Tasgetiren, M.F. Future trends and current state of smart city concepts: A survey. IEEE Access 2020, 8, 86448–86467. [Google Scholar] [CrossRef]
Mehaffy, M.W. Generative methods in urban design: A progress assessment. J. Urban. 2008, 1, 57–75. [Google Scholar] [CrossRef]
Geiger, A.; Lauer, M.; Urtasun, R. A generative model for 3d urban scene understanding from movable platforms. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 1945–1952. [Google Scholar]
Korczak, J.; Kijewska, K. Smart Logistics in the development of Smart Cities. Transp. Res. Procedia 2019, 39, 201–211. [Google Scholar] [CrossRef]
García, C.G.; Meana-Llorián, D.; G-Bustelo, B.C.P.; Lovelle, J.M.C.; Garcia-Fernandez, N. Midgar: Detection of people through computer vision in the Internet of Things scenarios to improve the security in Smart Cities, Smart Towns, and Smart Homes. Future Gener. Comput. Syst. 2017, 76, 301–313. [Google Scholar] [CrossRef]
Hodorog, A.; Petri, I.; Rezgui, Y. Machine learning and Natural Language Processing of social media data for event detection in smart cities. Sustain. Cities Soc. 2022, 85, 104026. [Google Scholar] [CrossRef]
Bijak, J. Forecasting International Migration: Selected Theories, Models, and Methods; Central European Forum for Migration Research: Warsaw, Poland, 2006. [Google Scholar]
Vanella, P.; Deschermeier, P. A Stochastic Forecasting Model of International Migration in Germany; Verlag Barbara Budrich: Leverkusen, Germany, 2018; pp. 261–280. [Google Scholar]
Fuchs, J.; Söhnlein, D.; Vanella, P. Migration forecasting—Significance and approaches. Encyclopedia 2021, 1, 689–709. [Google Scholar] [CrossRef]
Smailes, P.; Hugo, G. Rural communities and small area forecasting: Some examples from South Australia. Aust. Geogr. Stud. 1982, 20, 159–182. [Google Scholar] [CrossRef]
Commission, T.E. OECD Regional Development Studies Applying the Degree of Urbanisation A Methodological Manual to Define Cities, Towns and Rural Areas for International Comparisons: A Methodological Manual to Define Cities, Towns and Rural Areas for International Comparisons; OECD Publishing: Paris, France, 2021. [Google Scholar]
Cromartie, J.; Bucholtz, S. Defining the “Rural” in Rural America. Amber Waves: The Economics of Food, Farming, Natural Resources, 650 and Rural America 2008. pp. 1–8. Available online: https://newprairiepress.org/cgi/viewcontent.cgi?article=1430&context=jiaee (accessed on 20 August 2024).
Pateman, T. Rural and urban areas: Comparing lives using rural/urban classifications. Reg. Trends 2011, 43, 11–86. [Google Scholar] [CrossRef]
Dijkstra, L.; Hamilton, E.; Lall, S.; Wahba, S. How Do We Define Cities, Towns, and Rural Areas; World Bank Blogs: Washington, DC, USA, 2020; Volume 10. [Google Scholar]
Urbanski, M. Comparing push and pull factors affecting migration. Economies 2022, 10, 21. [Google Scholar] [CrossRef]
Sudakova, A.E.; Tarasyev, A.A.; Sandler, D.G. A dynamic forecasting model for scientific migration in the region. Econ. Reg. 2021, 17, 1196–1209. [Google Scholar] [CrossRef]
Fantazzini, D.; Pushchelenko, J.; Mironenkov, A.; Kurbatskii, A. Forecasting internal migration in Russia using Google Trends: Evidence from Moscow and Saint Petersburg. Forecasting 2021, 3, 774–804. [Google Scholar] [CrossRef]
Wahba, J. Return Migration and Economic Development; Edward Elgar Publishing: Cheltenham, UK, 2014; pp. 327–349. [Google Scholar]
Bronitsky, G.; Vakulenko, E. Using Google Trends for external migration prediction. Demogr. Rev. 2022, 9, 75–92. [Google Scholar]
Golenvaux, N.; Alvarez, P.G.; Kiossou, H.S.; Schaus, P. An LSTM approach to Forecast Migration using Google Trends. arXiv 2020, arXiv:2005.09902. [Google Scholar]
Wu, W.Y.; Chen, S.P. A prediction method using the grey model GMC (1, n) combined with the grey relational analysis: A case study on Internet access population forecast. Appl. Math. Comput. 2005, 169, 198–217. [Google Scholar] [CrossRef]
Tang, X.; Cai, X.; Zhang, R.; Jia, Y. Research and Simulation of Population Forecast Based on BP Neural Network. In Proceedings of the 2022 2nd International Conference on Electronic Information Engineering and Computer Technology (EIECT), Yan’an, China, 28–30 October 2022; pp. 302–305. [Google Scholar]
Zakria, M.; Muhammad, F. Forecasting the population of Pakistan using ARIMA models. Pak. J. Agric. Sci. 2009, 46, 214–223. [Google Scholar]
Wiśniowski, A.; Smith, P.W.; Bijak, J.; Raymer, J.; Forster, J.J. Bayesian population forecasting: Extending the Lee-Carter method. Demography 2015, 52, 1035–1059. [Google Scholar] [CrossRef]
Wang, C.W.; Huang, H.C.; Liu, I.C. A quantitative comparison of the Lee-Carter model under different types of non-Gaussian innovations. Geneva Pap. Risk-Insur.-Issues Pract. 2011, 36, 675–696. [Google Scholar] [CrossRef]
Wang, W. Forecasting The Population of China From 2020 To 2025 Based on Random Forest and Linear Regression. Highlights Sci. Eng. Technol. 2024, 85, 511–518. [Google Scholar] [CrossRef]
Galasso, J.; Cao, D.M.; Hochberg, R. A random forest model for forecasting regional COVID-19 cases utilizing reproduction number estimates and demographic data. Chaos Solitons Fractals 2022, 156, 111779. [Google Scholar] [CrossRef]
Wang, C.Y.; Lee, S.J. Regional population forecast and analysis based on machine learning strategy. Entropy 2021, 23, 656. [Google Scholar] [CrossRef] [PubMed]
Wilson, T.; Grossman, I.; Alexander, M.; Rees, P.; Temple, J. Methods for small area population forecasts: State-of-the-art and research needs. Popul. Res. Policy Rev. 2022, 41, 865–898. [Google Scholar] [CrossRef]
Frid-Adar, M.; Klang, E.; Amitai, M.; Goldberger, J.; Greenspan, H. Synthetic data augmentation using GAN for improved liver lesion classification. In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018; pp. 289–293. [Google Scholar]
Fawaz, H.I.; Forestier, G.; Weber, J.; Idoumghar, L.; Muller, P.A. Data augmentation using synthetic data for time series classification with deep residual networks. arXiv 2018, arXiv:1808.02455. [Google Scholar]
Sweeney, L. Simple demographics often identify people uniquely. Health 2000, 671, 1–34. [Google Scholar]
Narayanan, A.; Shmatikov, V. How To Break Anonymity of the Netflix Prize Dataset. arXiv 2006, arXiv:cs/0610105. [Google Scholar]
Dwork, C.; Roth, A. The Algorithmic Foundations of Differential Privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
Hao, S.; Han, W.; Jiang, T.; Li, Y.; Wu, H.; Zhong, C.; Zhou, Z.; Tang, H. Synthetic Data in AI: Challenges, Applications, and Ethical Implications. arXiv 2024, arXiv:2401.01629. [Google Scholar]
Brasseur, P.; Verron, J. The SEEK filter method for data assimilation in oceanography: A synthesis. Ocean. Dyn. 2006, 56, 650–661. [Google Scholar] [CrossRef]
Wan, C.; Jones, D.T. Protein function prediction is improved by creating synthetic feature samples with generative adversarial networks. Nat. Mach. Intell. 2020, 2, 540–550. [Google Scholar] [CrossRef]
Chatterjee, S.; Byun, Y.C. A synthetic data generation technique for enhancement of prediction accuracy of electric vehicles demand. Sensors 2023, 23, 594. [Google Scholar] [CrossRef] [PubMed]
Bannur, N.; Shah, V.; Raval, A.; White, J. Synthetic Data Generation for Improved COVID-19 Epidemic Forecasting. medRxiv 2020. [Google Scholar] [CrossRef]
Raymer, J.; Guan, Q.; Shen, T.; Wiśniowski, A.; Pietsch, J. Estimating international migration flows for the Asia-Pacific region: Application of a generation–distribution model. Migr. Stud. 2022, 10, 631–669. [Google Scholar] [CrossRef]
Wang, Y.; Yao, X.; Liu, Y.; Li, X. Generating population migration flow data from inter-regional relations using graph convolutional network. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103238. [Google Scholar] [CrossRef]
Lu, Y.; Shen, M.; Wang, H.; Wang, X.; van Rechem, C.; Wei, W. Machine learning for synthetic data generation: A review. arXiv 2023, arXiv:2302.04062. [Google Scholar]
Sivakumar, J.; Ramamurthy, K.; Radhakrishnan, M.; Won, D. GenerativeMTD: A deep synthetic data generation framework for small datasets. Knowl.-Based Syst. 2023, 280, 110956. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Wiese, M.; Knobloch, R.; Korn, R.; Kretschmer, P. Quant GANs: Deep generation of financial time series. Quant. Financ. 2020, 20, 1419–1440. [Google Scholar] [CrossRef]
Zakharov, K.; Stavinova, E.; Lysenko, A. TRGAN: A Time-Dependent Generative Adversarial Network for Synthetic Transactional Data Generation. In Proceedings of the 2023 7th International Conference on Software and E-Business, ICSeB ’23, Osaka, Japan, 21–23 December 2003; pp. 1–8. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Kobyzev, I.; Prince, S.J.; Brubaker, M.A. Normalizing Flows: An Introduction and Review of Current Methods. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3964–3979. [Google Scholar] [CrossRef]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M.H. Diffusion Models: A Comprehensive Survey of Methods and Applications. arXiv 2024, arXiv:2209.00796. [Google Scholar] [CrossRef]
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. Adv. Neural Inf. Process. Syst. 2019, 32, 7335–7345. [Google Scholar]
Yu, X.; He, Y.; Xu, Y.; Zhu, Q. A Mega-Trend-Diffusion and Monte Carlo based virtual sample generation method for small sample size problem. In Proceedings of the Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2019; Volume 1325. [Google Scholar]
Sivakumar, J.; Ramamurthy, K.; Radhakrishnan, M.; Won, D. Synthetic sampling from small datasets: A modified mega-trend diffusion approach using k-nearest neighbors. Knowl.-Based Syst. 2022, 236, 107687. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4765–4774. [Google Scholar]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 2522–5839. [Google Scholar] [CrossRef]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
Edwards, D.A. On the kantorovich–rubinstein theorem. Expo. Math. 2011, 29, 387–398. [Google Scholar] [CrossRef]
Endres, M.; Mannarapotta Venugopal, A.; Tran, T.S. Synthetic data generation: A comparative study. In Proceedings of the 26th International Database Engineered Applications Symposium, Budapest, Hungary, 22–24 August 2022; pp. 94–102. [Google Scholar]
Apellániz, P.A.; Jiménez, A.; Galende, B.A.; Parras, J.; Zazo, S. Synthetic Tabular Data Validation: A Divergence-Based Approach. arXiv 2024, arXiv:2405.07822. [Google Scholar] [CrossRef]
Lopez-Paz, D.; Oquab, M. Revisiting classifier two-sample tests. arXiv 2016, arXiv:1610.06545. [Google Scholar]
Yacouby, R.; Axman, D. Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, 20 November 2020; pp. 79–91. [Google Scholar]
Flach, P.A. ROC analysis. In Encyclopedia of Machine Learning and Data Mining; Springer: Berlin/Heidelberg, Germany, 2016; pp. 1–8. [Google Scholar]

Figure 1. The structure of the data, obtained from the annual statistical reports of municipal authorities in Russia.

Figure 2. Feature importance measured as the average absolute Shapley value (mean

(| SHAP value |)

) for each attribute. Features are arranged in non-increasing order by mean

(| SHAP value |)

and their names are located along the Y-axis.

Figure 2. Feature importance measured as the average absolute Shapley value (mean

(| SHAP value |)

) for each attribute. Features are arranged in non-increasing order by mean

(| SHAP value |)

and their names are located along the Y-axis.

Figure 3. Correlations between the features to compare the differences in attributes’ dependence in large and small cities.

Figure 4. The framework of migration balance forecasting. Blue color blocks represent the generation data and prediction model for the classification problem, and the red color represents the same for the regression problem;

g_{θ}

is the generator, and

d_{η}

is the discriminator;

L

corresponds to the loss functions.

Figure 4. The framework of migration balance forecasting. Blue color blocks represent the generation data and prediction model for the classification problem, and the red color represents the same for the regression problem;

g_{θ}

is the generator, and

d_{η}

is the discriminator;

L

corresponds to the loss functions.

Figure 5. The p-values from the Kolmogorov–Smirnov two-sample test were used to test the null hypothesis of equality of attribute distributions for each column in both the real and synthetic datasets.

Figure 6. The qualitative analysis of the classification problem using ROC curve and feature importance: (a) ROC curve. The X-axis represents false-positive predictions, while the Y-axis represents true-positive predictions; (b) feature impact on training data. The names of the features are located along the Y-axis. In the X-axis, the values above the zero point indicate a positive impact on the model’s prediction and vice versa for negative values. The color coding corresponds to the magnitude of the feature’s values.

Figure 7. The qualitative analysis of the regression problem using scatter plots and feature importance: (a) the scatter plot for the test data. OLS represents the ordinary least square forecast for our results, and CI is a confidence interval for the OLS line; (b) the feature impact on training data. The names of the features are located along the Y-axis. In the X-axis, the values above the zero point indicate a positive impact on the model’s prediction and vice versa for negative values. The color coding corresponds to the magnitude of the feature’s values.

Table 1. Small city attributes’ statistics. The total number of selected features was 13.

Name	Meaning	Min	Max	Median
popsize	Number of people (thnd. ppl.)	20.7	95.1	52.6
avgemployers	Average number of employers in organizations (thnd. ppl.)	5.3	29.3	11.1
unemployed	Number of unemployed persons (ppl.)	106.0	2112.0	320.5
avgsalary	Average salary per month (rub.)	15,091.6	63,522.0	24,898.1
livarea	Live area per capita (sq.m.)	20.0	36.4	24.8
invest	Investments per capita (mln. rub.)	8.02	172.03	27.5
factoriescap	Amount of self-produced goods (mln. rub.)	258.5	49,354.7	22,271.1
conscap	Volume of annual construction (mln. rub.)	3.9	5651.7	323.5
consnewareas	Commissioning of residential buildings (thnd. sq.m.)	0.0	66.5	10.2
retailturnover	Retail turnover (mln. rub.)	352.1	24,179.7	4853.1
foodservturnover	Food services turnover (mln. rub.)	14.4	886.9	183.8
lat	Settlement latitude (decimal degree)	47.06	60.4	48.2
lon	Settlement longitude (decimal degree)	28.05	40.1	39.2

Table 2. Synthetic data quality. The blue color corresponds to the best result.

Dataset	Model	Total Variation	Jensen–Shannon Divergence	Correlation Similarity	C2ST
$D_{c}$	Real	$1.000$	$0.000$	$1.000$	$0.500 (1.000)$
	CTGAN	$0.846$	$0.316$	$0.945$	$0.434 (0.452)$
	TVAE	$0.934$	$0.301$	$0.951$	$0.498 (0.688)$
	Copula GAN	$0.778$	$0.314$	$0.861$	$0.441 (0.737)$
$D_{r}$	Real	–	$0.000$	$1.000$	$0.500 (1.000)$
	CTGAN	–	$0.298$	$0.911$	$0.427 (0.711)$
	TVAE	–	$0.286$	$0.952$	$0.448 (0.773)$
	Copula GAN	–	$0.304$	$0.935$	$0.443 (0.706)$

Table 3. Classifier quality assessment. All values were calculated on a randomized test sample 100 times and averaged to obtain a statistically significant result. The blue color corresponds to the best result.

Model	$F_{1}$	ROC AUC
${\tilde{D}}_{c}$	$0.91$	$0.96$
$D_{c}$	$0.36$	$0.56$
Large cities	$0.75$	$0.74$

Table 4. Regression models’ quality assessment. All values were calculated on a randomized test sample 100 times and averaged to obtain statistically significant results. We computed the standard deviation (STD) of the errors and the 95% confidence interval (CI) for the mean of errors. The blue color represents the best result for each model.

Model	Data	$MSE \cdot 10^{- 4} \pm STD \cdot 10^{- 4}$ (Normalized Values)	$MAE \pm STD$ ( $CI, 95 %$ ) (People)
Linear Regression	$D_{r}$	$4.06 \pm 2.18$	$410 \pm 105 (\pm 21)$
	${\tilde{D}}_{r}$	$1.94 \pm 0.01$	$268 \pm 1 (\pm 1)$
	Large cities	$6.21 \pm 0.52$	$516 \pm 25 (\pm 5)$
	${\tilde{D}}_{r}$ + ${\tilde{D}}_{c}$	$1.73 \pm 0.04$	$239 \pm 4 (\pm 1)$
XGBoost	$D_{r}$	$4.08 \pm 2.48$	$359 \pm 108 (\pm 21)$
	${\tilde{D}}_{r}$	$0.48 \pm 0.05$	$127 \pm 6 (\pm 1)$
	Large cities	$18.9 \pm 8.75$	$873 \pm 170 (\pm 33)$
	${\tilde{D}}_{r}$ + ${\tilde{D}}_{c}$	$0.46 \pm 0.11$	$123 \pm 8 (\pm 2)$
Random Forest	$D_{r}$	$2.91 \pm 1.42$	$307 \pm 79 (\pm 15)$
	${\tilde{D}}_{r}$	$0.37 \pm 0.06$	$105 \pm 5 (\pm 1)$
	Large cities	$7.53 \pm 1.09$	$581 \pm 55 (\pm 11)$
	${\tilde{D}}_{r}$ + ${\tilde{D}}_{c}$	$0.36 \pm 0.04$	$104 \pm 5 (\pm 1)$
MLP	$D_{r}$	$23.16 \pm 10.89$	$1144 \pm 328 (\pm 91)$
	${\tilde{D}}_{r}$	$2.39 \pm 0.05$	$292 \pm 5 (\pm 2)$
	Large cities	$6.46 \pm 1.66$	$521 \pm 72 (\pm 34)$
	${\tilde{D}}_{r}$ + ${\tilde{D}}_{c}$	$2.33 \pm 0.05$	$280 \pm 4 (\pm 1)$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zakharov, K.; Aghajanyan, A.; Kovantsev, A.; Boukhanovsky, A. Forecasting Population Migration in Small Settlements Using Generative Models under Conditions of Data Scarcity. Smart Cities 2024, 7, 2495-2513. https://doi.org/10.3390/smartcities7050097

AMA Style

Zakharov K, Aghajanyan A, Kovantsev A, Boukhanovsky A. Forecasting Population Migration in Small Settlements Using Generative Models under Conditions of Data Scarcity. Smart Cities. 2024; 7(5):2495-2513. https://doi.org/10.3390/smartcities7050097

Chicago/Turabian Style

Zakharov, Kirill, Albert Aghajanyan, Anton Kovantsev, and Alexander Boukhanovsky. 2024. "Forecasting Population Migration in Small Settlements Using Generative Models under Conditions of Data Scarcity" Smart Cities 7, no. 5: 2495-2513. https://doi.org/10.3390/smartcities7050097

Article Menu

Forecasting Population Migration in Small Settlements Using Generative Models under Conditions of Data Scarcity

Abstract

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Migration Forecasting

2.2. The Use of Synthetic Data

2.3. Generative Methods

3. Initial Dataset and Feature Selection

4. Methodology

5. Problem Formulation and Synthetic Data

6. Experimental Study

6.1. Data Description

6.2. Synthetic Data Quality Assessment

6.3. Migration Forecasting Assessment

7. Discussion

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI