1. Introduction
As a concept integrates information and communication technology, a smart city integrates the latest technological advancements in engineering, urban planning, and economic theory, among others, which are especially improved by using modern artificial intelligence (AI) technologies. In this field, the concept of smart cities is often seen as a strategic approach that aims to improve overall efficiency and effectiveness. Governments and public agencies are adopting this approach in order to differentiate their policies and programs and achieve sustainable development, promote economic growth, and improve the quality of life of their citizens [
1].
The main objectives of smart cities are to ensure optimal resource utilization, promote renewable energy sources, enhance safety and environmental quality, optimize the efficient use of human resources, and leverage AI to analyze large volumes of data and control the city infrastructure [
2,
3,
4]. The main driver for development is the adoption of IoT technologies in the form of a range of sensors and transmitters, which collect big data on the environment around them. Then, these data are transmitted to central servers. Once there, the data are analyzed, and machine learning models are developed to solve specific problems [
3,
5].
The future development of smart cities will be focused on the complete digital transformation of the economy [
5,
6]. This will involve the use of generative planning models to create urban landscapes [
7,
8] as well as the optimization of logistics supply chains [
9]. Additionally, computer vision technologies will enhance citizen integration with the city [
4,
10], while natural language models will facilitate communication between citizens and the city’s systems [
11].
The problem of forecasting population migration is a crucial aspect of modern city government [
12,
13]. Understanding the movement of citizens between different locations or groups of locations is essential for the planning of budgets in specific regions. Budgeting can be employed to facilitate the advancement and enhancement of settlements in terms of transportation, landscaping, economic conditions, educational institutions, and other aspects [
14].
In the field of migration flow modeling, collecting data on small towns can be challenging due to the unique geographical features and complex data administration systems in these communities. In dealing with forecasting population migration between different regions, we have faced a problem of data scarcity for settlements and territories with small populations [
15].
In various countries around the world, there is a varying number of residents required for a settlement to be considered a small one [
16,
17]. Many countries use a certain population threshold to define urban areas [
18]. This threshold can vary significantly, ranging from 200 people (as in Denmark) to 100,000 people (China). Other countries use thresholds of 2000, 5000, and 50,000 [
19]. In our framework, we considered the population size below 100,000 people. In Russia, the population of a regional center is on average larger than this number, while the population of municipal district centers and other cities is smaller. By increasing the number of search filters for data, the volume of data available for training is unfortunately insufficient.
At the same time, for local development planning the migration, details, down to the level of a small town or district, are crucial. Most statistics regarding these settlements are not publicly available. Only large cities and regions publish their data, and very few small towns provide the necessary information.
This paper contributes to improving data-driven migration forecasting by means of generative AI to create a synthetic dataset when real-world data are scarce (e.g., for small towns and sparsely populated regions). For this purpose, we found that an ensemble of two models trained on two synthetic datasets predicts the migration balance with real-world data better than a single regressor. The regression in the ensemble is for the migration flow forecast, and the binary classification is to specify its direction. We trained the generative model to augment the initial small dataset for classification and regression problems separately. Then, we used an additional generator, which was modified using the regression loss function and thus helped to improve the synthetic dataset. Finally, we trained the forecasting models to predict the migration balance. Our research may be useful for direct migration forecasting as it facilitates the assessment of features that have an influence on the migration direction and extent.
The remainder of this paper is organized as follows: in the next section, we provide a literature review on the topic under consideration; in
Section 3, we analyze the real-world data and identify key features for migration forecasting; in
Section 4, we propose the methodology; in
Section 5, we formulate the problem and the synthetic approach to augment the initial small cities data; the last section is devoted to experiment study and results.
2. Related Work
2.1. Migration Forecasting
A significant amount of attention in demographic research is focused on the problem of migration [
14,
20,
21]. Such interest in migration is caused by the fact that this process has a high impact on various fields [
14,
21]. In addition to the typical labor force displacement caused by low-paid and unattractive jobs in host regions, migration can significantly impact a region’s development in terms of “brain gain” or “brain drain” [
21] and natural population change (positive or negative) [
14,
20].
It is worth noting that the issue under consideration is significant not only in terms of international migration but also in terms of domestic migration [
22]. Migration can be used as a reflection of the social and economic conditions of certain regions. Therefore, an accurate estimation of migration can also be beneficial for the development of new economic policies. Thus, the development of methods to manage migration flows is highly relevant. The success of any approach involving this issue will depend heavily on the management of key aspects of the problem.
The problem of identifying the main set of factors that motivate people to relocate is in the spotlight in the academic community [
23]. Nevertheless, despite the widespread discussions on a certain set of factors, the classification of migration factors based on the push-and-pull approach (proposed by Everette Lee) is commonly used in various studies [
20,
21]. According to this approach, all relevant factors can be divided into two categories: those that act as push factors, motivating individuals to leave their current location, and those that serve as pull factors, attracting individuals to move to specific destinations. Push factors include economic (unemployment, low welfare), political (lack of freedom, persecution), environmental (harsh climate, pollution), social (discrimination, undesirable lifestyle model), etc. [
20,
21]. By contrast, pull factors should provide far greater opportunities in terms of education, career, and quality of life when compared to the origin of the individual.
It is challenging for demographic researchers to accurately determine the actual numbers of migration flows [
24,
25]. Many countries use their own criteria to define the migrant status, e.g., it can be different time ranges [
24]. Moreover, in most countries, the fact of migration is officially recognized only when a person changes their so-called “registration” (each resident is registered by his home address) [
25]. Since most people do not change their registration, because they often rent an apartment, there is a huge difference between the official place of residence and the actual one. This makes research aimed at finding alternative sources for determining the fact of migration highly relevant, e.g., using Google trends analysis [
24,
25].
Migration balance, which is the target variable for our models, strongly depends on many social and economic factors. However, these factors have a high level of volatility, which makes it difficult to develop a robust forecasting model. If we consider the history of migration as a time series, the model could not identify any stable patterns or significant autocorrelations. That is why in our research, we investigate the influence of factors instead of considering the migration balance as a time series. Therefore, models such as gray models [
26], BP [
27] or autoregressive models [
28] are not suitable for this problem.
Other approaches include the Bayesian approach to forecasting populations [
29]. This approach incorporates Lee–Carter-type models [
30] for predicting age patterns, along with the associated measures of uncertainty, for fertility, mortality, immigration, and emigration, within a cohort projection framework. Some studies utilize random forest [
31,
32], XGBoost [
33], and regression methods [
34]. Specifically, the authors of [
34] considered the challenges and appropriate methods for small-area population forecasting. However, in our research, we constructed a model to predict the migration balance, which differs from population forecasting. Due to difficulties in data collection and analysis, there is a lack of research in this area.
2.2. The Use of Synthetic Data
One potential solution to the lack of historical data is the use of synthetic data [
35,
36]. Synthetic data allow for the augmentation of the original dataset by adding new instances that are similar in distribution to the original one. Real-world data may not be representative of all possible situations, which can lead to biased or limited generalizations. Synthetic data can provide a wider range of scenarios, allowing the model to learn how to handle a broader variety of situations and inputs.
There are three types of synthetic data: fully synthetic (where all original data are replaced with synthetic data), partially synthetic (where some attributes are replaced with synthetic features, particularly those that are sensitive), and hybrid synthetic (where some records from the original dataset are replaced with synthetic entries or additional synthetic instances are added). In our study, we used the hybrid type.
In addition, synthetic data can be used to hide confidential information. Researchers have shown that by combining features, the so-called “quasi-identifiers”, it is possible to reconstruct the values of certain sensitive attributes [
37,
38]. Thus, by substituting source data with synthetic data, the overall level of data privacy in machine learning models can be enhanced [
39]. Another common use of synthetic data is filtering, such as removing noise or outliers from the data [
40,
41].
Several studies have also highlighted approaches for using synthetic data in forecasting problems. For instance, the authors of [
42] used synthetic data to predict protein function. Specifically, the WGAN method and two-sample classifier test were employed. The researchers’ findings demonstrated a significant improvement in forecasting accuracy. Similar successes have been achieved in [
43] for electric vehicles’ demand prediction and in [
44] for COVID-19 epidemic forecasting. Some studies also use generation techniques in the problem of migration flows [
45,
46].
2.3. Generative Methods
There have been numerous studies conducted on the topic of data generation [
47,
48]. Some of the most prominent examples of generative models include generative adversarial networks (GANs) [
49,
50,
51], variational autoencoders (VAEs) [
52], normalizing flows [
53], and diffusion models [
54]. In the article, we used the state-of-the-art approach of tabular data generation. It is called CTGAN [
55]. It handles non-Gaussian and multimodal continuous features and imbalanced discrete columns. The authors proposed a methodology of learning by samples, where at each iteration, a categorical attribute and its corresponding value were selected. Subsequently, all possible combinations of indices and values of categorical features with this index were sorted out, ensuring uniform learning across all possible columns. It is also worth noting the TVAE model [
55], like CTGAN, employs the same methodology but is based on the VAE [
52] principle.
Several papers are dedicated to generating data for small datasets that use Mega-Trend-Diffusion [
49,
56]. This approach is based on determining the acceptable range of values for dataset attributes for sampling purposes and then generating synthetic records. The generation process involves selecting points from the acceptable range following a uniform distribution. The acceptance of a new record is determined based on a triangular membership function, which takes the selected point as input and returns a probability value [
57]. For this, a random variable must be generated with a uniform distribution in the range
, and the data point is accepted if the value of the membership function at that point is greater than the generated random variable. Such approaches show good results in terms of training time and different kinds of metrics, such as pairwise correlation distance, NNDR, and DCR [
49], among others.
Another popular approach for generating new instances or balancing the labels in a classification problem is the synthetic minority oversampling technique (SMOTE) [
58]. This simple but effective technique generates data in the direction of the neighboring points and the original observation, with the addition of a random uniform noise in the range
. While SMOTE addresses the issue of information gaps in the minority class, it does not account for the distribution of the entire set of minority samples.
3. Initial Dataset and Feature Selection
The nature of the problem requires the collection and processing of data that contain important characteristics of cities. The most complete information in this area can be obtained from the official databases of the government statistics service. It is clear that alternative sources, e.g., websites, social media posts, or city administration services, can provide some of the necessary data. In fact, collecting data from these sources should be a priority, as data from different sources make the results of the analysis more trustworthy. In most cases, the source of such data has almost no alternatives, since only the authorities of a given city know the real values of key statistical indicators and pass them on to federal services.
The real-world data collection method that we use in our research involves the collection of a large volume of data from each city in the following categories: demography, living standards and consumption, and Production. Each of these categories has a certain number of features (attributes). For example, the category “living standards” includes characteristics such as retail turnover, live area per capita, the number of doctors per person, etc. The category “production” includes data on the volume of annual construction, the number of apartments built, etc. Due to the fact that in many studies the push factors are of a socioeconomic nature [
14,
20,
21,
23], these data can be considered a basis for migration forecasting. Secondly, the category “demography” includes an indicator of net migration, which greatly simplifies the task, because most of the necessary data for developing a predictive model are available in one collection. For a detailed description of the collected data, see
Section 6.1.
Figure 1 presents the schema of the data that we generated for our experiments. Their sources are the published reports of the municipal authorities of several towns and districts in Russia. For other countries and regions, similar factors could be available in the statistical reports or even in some global data collection databases like the Atlas of Urban Expansion (AUE) and the Global Municipal Database (GMD) (
https://globalmunicipaldatabase-guo-un-habitat.hub.arcgis.com/, accessed on 20 August 2024).
This list of indicators, of course, is not exhaustive. It is used for research purposes. In practice, it can certainly be expanded significantly, depending on the availability of real-world data and the peculiarities of certain countries and districts.
Table 1 shows the key statistics of all attributes from the dataset for small cities. We provided an explanation of these attributes and specified the units of measurement in parentheses. Statistics are included as minimum, maximum, and median values.
However, the collection process has some drawbacks. The main issue is related to the fact that this collection represents only cities with a population of over 100 thousand. On the one hand, this means that the dataset for future processing will be smaller than expected since the majority of cities do not fit this condition. On the other hand, this limitation prompted us to look for some solutions for working with small settlements to fully achieve the goal of the study, which was to forecast migration flows in any settlement, even small ones. It should be noted that data on net migration was only available from 2010 onwards. This also reduced the total number of examples in the dataset.
As a result of data processing, 13 socioeconomic features were selected, and the total number of examples was more than 2000. One of the critical steps after selecting features is to evaluate their relative importance. The decision trees can be used to perform an initial study to evaluate the predictive ability. This approach is known to be easy to understand and fast in terms of the training process, making it well suited for such an investigation. Since the collected dataset was not a large one for machine learning (little more than 2000 examples), it was sensible to perform a series of training cycles to obtain an average evaluation on a testing sample.
We used the feature importance analysis to identify the most significant attributes. We utilized the Shapley value-based feature importance algorithm. The Shapley value (SHAP value) for a specific feature is the difference between the expected model output and the partial dependence plot at the feature’s value [
59]. The sum of the SHAP values for all input features will always be equal to the difference between the baseline output of the model and the actual output of the current model for the given prediction [
60]. In
Figure 2 the X-axis shows the average absolute SHAP values for all attributes in our dataset, and the Y-axis shows the attributes’ names; the greater the absolute value, the greater the impact this attribute has on the model’s output. We sorted all values in order of non-increasing significance of the features.
Figure 2 shows that the most significant features are the commissioning of residential buildings (consnewareas), the average salary (avgsalary), retail turnover, the longitude (lon), and the volume of annual construction (conscap). While the importance of the average salary is beyond any doubt, because the expected income is definitely a crucial factor in the decision to move [
23], the significance of the construction features is not so clear at first glance.
The importance of the commissioning of residential buildings, as well as other features from the construction category, is directly related to the approach for defining the fact of migration. As we have already mentioned, the official change in the place of residence (the fact of migration) for any citizen occurs only when a person changes their registration. Thus, when a person buys a new house or apartment, his or her actual place of residence will be officially recorded (due to new registration), which does not mean that the person has moved recently, because of the possibility of renting an apartment. Thus, in such a system, the housing market is part of the identification of the actual place of residence. The other important factor could be related to the use of foreigners in construction. This means that settlements with a high volume of construction will experience an increase in international migration.
When constructing a model for small cities, the most significant features are the annual volume of investment (invests), the population size (popsize), and the annual industrial capacity (factoriescap). All signs point to the intuitive understanding that it is necessary to invest money in cities in order to attract more people.
Figure 3 shows the correlational dependencies between the features. We note that retail turnover has the most significant influence on the dependent variable, migration balance (saldo), for large cities, while investments have little impact, as in large cities, investments are always at a high level, and the focus is shifting toward improving the city and housing sector. For small cities, we note the negative effect of investment amount on migration balance. This effect can be explained by the fact that investments are measured over the past year. Therefore, in reality, with an increase in investment this year, there is a decrease in the number of citizens leaving or an increase if the balance is positive. For example, by computing the correlation with year change, we achieved
by choosing several small cities.
The analysis shows that the relative importance of features in the processed set is quite reasonable. This provides a strong basis for conducting a series of experiments to assess the ability to forecast migration based on the collected data and selected features using a machine learning approach.
It is important to note that the dataset contains only relatively large settlements (population size is over 100 thousand), while small settlements are of great interest from the point of view of the urban planning platform and also the overall goal of this research as well. Unfortunately, it is very difficult to collect such data, because the only source could be textual reports that are published on the authority websites of a particular settlement. In this regard, it was decided to try to forecast net migration in small settlements using the model that was trained on the large ones. To test this idea, several dozen small settlement examples were found and processed.
One of the critical reasons that limits the accuracy of the model is probably the small size of the training dataset. The solution to this problem can be achieved by generating synthetic data, which could improve the quality of model fitting and lead to more accurate forecasting of net migration not only for small settlements but also for settlements of any size.
4. Methodology
Our goal was to predict the migration balance based on socioeconomic indicators. This task prompted us to use the regression problem. As we mentioned before, we could not carry out proper data collection from small settlements. After the preprocessing procedure, we obtained the dataset , where the lower index relates to the regression problem. This was the initial dataset used to predict the balance of population migration in small settlements. Specifically, the migration balance (saldo) attribute in this dataset is represented as a real number.
We conducted several experiments and found that regression models could not recognize the dependencies between regressors and the dependent variable “saldo”. Therefore, we decided to solve two tasks separately: (i) the identification of the sign of migration balance and (ii) the determination of the amount of migration balance. The experiment results are provided in
Section 7.
The first task was to determine whether there was an outflow or an inflow. For this, we had to determine the sign of the absolute value of migration balance (i.e., to determine if the migration is inbound or outbound) by solving the classification problem. In this instance, the issue of binary classification arises, which can be addressed using well-established techniques, such as logistic regression, decision trees, or ensembles of such models.
In the next step, we trained a regression model using the same set of features but with the dependent variable being the actual value from . Various options were explored in this work, including linear regression, random forest, and gradient boosting, among others.
Figure 4 shows the general pipeline of the proposed method. The blue section corresponds to the classification problem, and the red section represents the regression problem.
5. Problem Formulation and Synthetic Data
In a more formal way, we consider two datasets and which consist of and where is the feature vector, is the dependent variable, is the binary class, and N is the number of records. The dimension of feature space in our case is equal to . The dataset corresponds to the migration balance, which takes the form of labels in the migration balance attribute. Label 0 corresponds to an outflow of residents, while Label 1 corresponds to an inflow of residents.
The number of records in and is equal to N, and we assume that this number is not large enough to train a predictive model. In our case, . Here, we used the concept of synthetic data to augment the initial one.
In the first step, we trained the predefined generative model to cover the support of the joint distribution of . After the generative model was trained, we developed the generator, which took the noise vector as input and produced the synthetic sample , where and is the number of synthetic samples for the classification problem. After that, we trained the classifier choosing from for the train attributes and the last column of for labels , where .
To train the classifier, we took the partition of
on train and validation samples with a ratio of 8:2. After several experiments, we chose the XGB classifier for our purposes. For hyperparameter tuning, we used the Bayesian optimization framework Optuna [
61]. As for the objective function, we used the
score, which was computed by cross-validation with five folds.
After we trained the classifier and selected the optimal hyperparameters, we applied the classifier to a test sample, which was a subset of the small settlement dataset. The result is the predicted labels , where , and L is the test sample size.
The second step involved the regression problem for . We generated synthetic data and trained the regression model. After we received the predicted labels and predicted values of the balance , we combined the results by adjusting the regression model predictions, specifying the signs in accordance with those predicted by the classifier.
The issue we encountered is that, in contrast to the classification problem, in regression, training on synthetic data, despite showing improvement, did not yield the desired outcome. Therefore, we employed an additional approach here.
Specifically, after we generated the synthetic dataset
using the arbitrary generative model, we used another generative model, Wasserstein GAN (WGAN) [
62], with an additional modification. The modification was to add an additional loss function to the generator, which would tell the generator how to transform synthetic samples to obtain a better score in the regression problem. For that, we first trained the neural network
to predict
, where in
are stored the learnable parameters. The network
has several fully connected layers and a ReLU activation function between them, except for the last layer.
The idea behind GAN was to find the desired measure with the absolute continuous density function
in
, where
, which would be close to the initial distribution of the data
with the absolute continuous density
. We can find this measure as a push-forward measure under the preimage of transformation
(generator) and some simple distribution
, i.e., as
, where
is the probability measure on
and
is referred to the parameters of the neural network. The second network, called discriminator
, was used to determine the quality of synthetic data, where
is also referred to as the trainable parameters. Originally, the discriminator produces the probability, but WGAN [
62] is the modification of the basic GAN [
49], where the discriminator produces the real number. Therefore, the authors proposed the new loss function based on the Kantorovich–Rubinstein duality [
63] as follows:
where
means the mathematical expectation under the measure, which corresponds to the density function
.
Finally, the discriminator has the following loss function:
where
is the density function, which corresponds to the generator. Our modified loss function for WGAN has the following form:
where
means the columns range from 0 up to the column
, and
is the mean squared error. The result of our modified WGAN is the synthetic dataset
, where
and
.
Finally, we trained the regression model on and evaluated it on the test sample, achieving . Then, we adjusted with and obtained the final target prediction.
6. Experimental Study
6.1. Data Description
We investigated migration forecasting in small settlements in the Russian Federation. We had a wide range of databases available for large cities with populations exceeding 100,000 people. The main socioeconomic indicators of cities were found in the database of Rosstat (
https://rosstat.gov.ru, accessed on 20 August 2024), which has been published from 2004 up to now (it is published once every two years). In total, 13 socioeconomic features were selected, and the total number of examples was more than 2000. The dataset contains information for approximately 180 cities from 2010 to 2022. The migration balance target for each city was shifted forward by one year for each data sample for forecasting purposes. The analysis of this collection highlighted its high relevance to the objectives of the current study. Data on large cities provide information on various factors such as macroeconomic indicators, migration patterns, and the overall level of regional development. Based on these data, indicators that were significant for forecasting migration balances were identified. Based on the selected criteria, we manually collected data from the websites of relevant government agencies (
http://radm.gtn.ru,
https://new.mo-siverskoe.ru, accessed on 20 August 2024) and other sources that publish open access information about small regions and cities (
https://old.citylifeindex.ru/database?pageType=CITIES, accessed on 20 August 2024).
We also prepared both datasets for small and large cities by removing outliers. Due to the lack of sufficient data, we removed outliers within the interquartile range multiplied by a factor of three on both sides.
6.2. Synthetic Data Quality Assessment
In this section, we present the results of the best generative model that can approximate the multivariate attributes’ distribution and augment the datasets with additional records for both the regression problem and the classification problem .
We used three predefined generative models on our data and the best one was chosen for the proposed pipeline. We trained the state-of-the-art framework for tabular data generation, CTGAN [
55], and its generalization, TVAE [
55], which has the same underlying idea but is applied in the context of a variational autoencoder. The third model was Copula GAN (
https://docs.sdv.dev/sdv/single-table-data/modeling/synthesizers/copulagansynthesizer, accessed on 20 August 2024). It is also a variation of the CTGAN model, which utilizes the CDF-based transformation applied by Gaussian copulas.
We estimated the synthetic data quality using several metrics, namely the correlation similarity between all attributes [
64], Jensen–Shannon divergence [
65], the total variation distance for the target attribute (saldo) for the classification dataset
, and classifier two-sample test (C2ST) [
66], to identify the ability of machine learning models to find the difference between synthetic and real data. We trained all models for 5000 epochs and generated 100 synthetic datasets of the length
N for statistical significance, and for each iteration, we computed the identified metrics. Finally, we calculated the average values, which are presented in
Table 2.
The underlying principle of C2ST was to label real data with 0 and synthetic data with 1. After that, we had to train a classifier to find the difference between the two data distributions. The optimal value for the statistic in the two-sample classifier test was , as this indicated that the synthetic and real data had a similar distribution. In other words, the classifier could not distinguish between the two datasets and generated a random variable following a Bernoulli distribution with a probability of .
For the null distribution in C2ST, we used the normal distribution
[
66]. The null hypothesis states that the two distributions are equal, and alternatively, that they are distinct. In brackets, we provide the
p-value. We chose
as the significance level at which the null hypothesis is rejected, and the statistic value should be greater than
. If
p-value is greater than the significance level, the null hypothesis cannot be rejected.
The results show that the TVAE model is appropriate to generate data on small cities better than others by taking the smallest Jensen–Shannon divergence and the closest correlation similarity. For the categorical attribute saldo, we also note that the total variation is better in the case of TVAE. C2ST does not provide the evidence to reject the null hypothesis about distinction, but it also does not say that we need to accept the null hypothesis. By contrast, the statistic value is closer to in the TVAE case.
Figure 5 shows the
p-values for the Kolmogorov–Smirnov two-sample test between real and synthetic data by each column. The null hypothesis states that the two distributions are equal. If the
p-value takes a value less than the significance level, then we reject the null hypothesis and can state that one sample differs from another. In our case, samples are the columns. The black horizontal line indicates the significance level, and the bars indicate the
p-values. We set the significance level at
.
In the CTGAN model, we can reject the null hypothesis about the equal distributions for several columns. This implies that there are several attributes in the synthetic dataset that differ significantly in distribution from the attributes in the real dataset, such that we can discuss the difference between their distributions. By contrast, TVAE and Copula GANs showed much better results. TVAE differs from the real dataset in the volume of annual construction (conscap). For the Copula GAN, the columns are longitude (lon) and the number of unemployed citizens (unemployed).
Based on the experimental study of training various generative models on our dataset, the TVAE was chosen for small cities’ data augmentation.
6.3. Migration Forecasting Assessment
Table 3 shows the results of the classification problem on different datasets of the test data. The test data were not used in the training process of the generator and were set aside for evaluating the performance of the model. We chose the
score [
67] and the ROC AUC score [
68] to assess the quality of the classifier. The classifier trained on the synthetic data
demonstrated the best performance. Due to the increased coverage of the data distribution in the initial small dataset, it was possible to enhance the data and generate new samples, which improved the overall quality of the results.
Figure 6 shows the qualitative analysis of the classification problem. We plotted the receiver operating characteristic (ROC) curve, which illustrates the performance of a binary classifier model (
Figure 6a). The X-axis corresponds to the false-positive rate (1 - specificity), and the Y-axis corresponds to the true-positive rate (recall). Classifiers that produce curves that are closer to the upper left corner of the graph indicate better performance. By contrast, a random classifier would be expected to produce points that lie along the diagonal.
Figure 6b shows the feature impact using the Shapley (SHAP) values. Positive Shapley values indicate a positive impact on the model’s output; in other words, the higher the value on the right side of zero, the greater the magnitude of the migration balance. Conversely, a lower Shapley value corresponds to a lower forecast value. The blue color represents low values of the attribute, while the red color represents high values. A high concentration of values in a particular area suggests that there are instances in the dataset that influence the model’s predictions in a similar manner. The Y-axis corresponds to the features that are arranged in accordance with feature importance by the average absolute SHAP values (mean
). As the model deals both with positive and negative migration balance, these values represent both the pull and push scenarios in the migration process.
For instance, we observe that low investment levels increase the migration balance, whereas high levels underestimate it. We observed this effect in the correlation plot (
Figure 3). Additionally, we note a clear clustering of data at three specific points. We may also notice that a lower retail turnover seems attractive to migrants because it is measured in money values, whereas a higher retail turnover often means more expensive goods for the customers. However, it is not necessary, as evidenced by the blue dots mixed with the red ones on the negative side of the impact plot.
Table 4 shows the results of the regression problem on the real test data. We trained four regression models, namely linear regression, XGBoost, random forest, and multilayer perceptron (MLP), on four datasets: real small cities’ data, real large cities’ data, synthetic for small cities, and synthetic for small cities with classifier correction. We calculated the MAE metric in terms of people, which directly indicated the number of individuals who departed or arrived in a specific settlement. For statistical significance, we shuffled the dataset and divided it by the training and test samples using a ratio of 8:2. We repeated this procedure 100 times. We also determined the standard deviation and 95% confidence interval for the mean error. The MSE was calculated in terms of normalized values. In
Table 4, we show the average results weighted by the population size. The relative metrics like mean absolute percentage error (MAPE) did not suit our task because the value of the migration balance determined using the number of migrants may be close to zero, which usually distorts the evaluation. The symmetric MAPE (sMAPE) only partially solves this problem. We did not use the statistical coefficient of determination because the available real dataset was quite small, and questions may arise about the statistical significance of the obtained results.
Figure 7 shows the qualitative analysis of the regression problem after specifying the signs.
Figure 7a corresponds to a scatter plot, where the X-axis represents the true values of the migration balance, and the Y-axis represents the predicted migration balance both measured in the number of migrants. In an ideal situation, we must determine the identity line (red color), which corresponds to the complete coincidence of the forecast with the real migration balance. We employed the ordinary least square (OLS) method to construct a regression model, where we used the true values for the independent variables and the predicted values for the dependent variable. We also utilized the confidence interval (CI) estimation for regression values, which has the following form:
where
is the critical value of the Student distribution at level
and
is the degrees of freedom,
is the standard error of estimate,
,
L is the test sample size, and
are the parameters of the OLS regressor.
The identity line must be covered by the confidence interval, which was constructed by the OLS line in the ideal case.
Figure 7b shows the feature importance diagram, the same as for the classification problem. The investment attribute has the same structure as in
Figure 6b but with one large cluster. A poor separation can be observed in the magnitude of the feature for the annual industrial capacity, the food service turnover, and the average salary.
7. Discussion
The results of the experimental study (
Section 6) show that the proposed method of synthetic data generation allows for obtaining datasets on social and economic states for model training regarding both classification and regression problems. As can be seen in
Table 2, the generated data produced by the TVAE method are similar to the real data with fairly good metrics of a correlation similarity of >0.95 and a Jensen–Shannon divergence of about
. We used CTGAN and TVAE because they are popular and widely used and more lightweight than diffusion models or other advanced approaches. However, to increase the diversity of synthetic data, it is possible to use more advanced models, which will most likely only increase the overall effectiveness of migration forecasting.
We used data on towns from completely different regions of Russia. These regions differ greatly in terms of socioeconomic development. This gives us reason to believe that the main ideas of the research could be successfully applied to forecasting migration flows in other places as well. The motivation for our study was the lack of data on migration balance in small settlements and the difficulties associated with obtaining them. Currently, the collected data are insufficient to correctly assess the generalizability of the developed models. As new data are collected, we plan to investigate this in future studies.
The use of an ensemble of a classifier and a regressor for migration balance forecasting can be explained by considering
Figure 6b and
Figure 7b. The Shapley value analysis of the features, including the number of unemployed citizens, the average salary, the food service turnover, and the volume of annual construction, indicates that a higher value for each of these features may lead to either a positive or negative impact on the regression results, depending on the specific circumstances. The separate classification model can correct this ambiguity.
The scatter plot in
Figure 7a indicates that our adjusted regression model can accurately predict the migration balance for small settlements, with only minor deviations from the identity line in several instances in the test sample.
We can also notice the significant level of heteroscedasticity both in the initial data, where the greater standard deviation of the migration balance follows the greater population size of the settlement, and in forecasting models, where the prediction error grows according to the migration balance. Experiments show that data normalization by the population size hardly helps with this problem, but due to synthetic data for model training, this effect can be reduced, which affects the improvement in the quality of the forecast. We confirmed this by training two similar models of gradient boosting. One uses the real-world dataset of large cities, while the other uses the synthetic data built on it. The test data are the same set from real-world factors. TheMAE for the first model is 7391, while for the second, it is 6790 people; thus, the quality improvement is about
A thoughtful analysis of the requirements leads to the idea of developing a universal migration forecasting model that would be useful for any type of settlement. Besides direct forecasting on real (current) data, the general idea is in the ability of such a tool to test different values for each location’s feature, which is in fact a kind of simulation that can be used with a certain degree of reliability to assess the impact of some policy on the migration flow in a certain settlement. As we have already revealed, there are two types of migration: one for large cities and agglomerations and the other for small settlements. So, the method used for building a universal model is a classification that can recognize the type of migration, followed by the application of the pretrained model or ensemble according to the predicted type.
Migration balance forecasting in small towns itself is essential for understanding the demographic and economic shifts. It helps in identifying areas of growth and helps inform policies and investments that can support the development of these communities. Moreover, to address social and economic factors, the models that we present in this research can provide tools for evaluating plans and projects for territory development, as they can show if some transformation will make the settlement attractive or repulsive for its inhabitants. Last but not least, the benefit of our method is its ability to analyze the migration balance at a small scale, which makes it useful for a wide range of local problems and challenges.
We applied an adaptive prediction model based on gradient boosting to determine the extent to which it is possible to obtain a high-quality forecast using our data. Another important aspect of this research is the hybrid approach to migration forecasting compared with a single regression model. The forecast error is already comparable to the accuracy of the source data, which is sufficient for the given task. Therefore, improved models have the potential to outperform the baseline model. Furthermore, our estimate can serve as the basis for all other models.
The advantage of the hybrid model over the single model approach is due to the fact that we are dealing with processes of a slightly different nature. We are now moving toward a hybrid approach that combines the three models. In further work, we plan to show that the same features have completely different levels of importance when it comes to predicting inflow, outflow, or even direction (classifier). The classifier determines the direction that minimizes the heterogeneity, and the regressor identifies the dependencies. This is the main reason why a hybrid model could be a better solution.
8. Conclusions
In our research, we investigated the problem of predictive models’ training in case of insufficient amount and quality of training data on social and economic states in small settlements. The proposed method of synthetic data generation applied to the problem of migration balance forecasting showed that the models trained on this artificial dataset performed the forecasting with higher accuracy than similar models trained on a small and incomplete set of real-world data.
The other finding of our work concerned the problem of migration balance forecasting. It was found that predicting the sign of the balance, i.e., the direction of migration, and the number of migrants separately helps to increase the prognosis quality, and the artificial training datasets should be different for the classification problem of migration direction and the regression problem of the number of migrants, whereas the real data for making a prognosis may be the same.
Moreover, during our research, we investigated the influence of social and economic factors on migration tendencies, which are significantly different for small settlements compared to large cities and agglomerations.
The obtained research results and developed models will be useful for value-based modeling of community development programs. They are intended for the optimal formation of a series of projects (events) for the transformation of the territory, including the placement of infrastructure facilities, changing the factors of migration dynamics, and the functional purpose of individual territories in order to achieve the target values of social and economic development indicators. The value of the task is ensured within the framework of forecasting the social risks caused by conflicts of goals and the needs of the population and measures to transform the territory. When implementing the models, it is assumed that these cognitive technologies will be used to predict social risks.