Towards Synthetic Augmentation of Training Datasets Generated by Mobility-on-Demand Service Using Deep Variational Autoencoders
Abstract
:1. Introduction
2. Related Work
3. Methodology
3.1. Use Case and Data Preparation
3.2. The Proposed VAE Framework
3.3. Generating Input Variables for VAEs Decoder from Learned Latent Space
4. Results and Discussion
4.1. Evaluation of All Investigated Dimensionality Reduction Methods for Clustering Based on MoD Data
4.2. Analysis of Clustering Based on a Combined Dataset with Synthetic and Real-World MoD-Based Data Samples
4.3. Evaluation of Learning Performance Using a Combined Dataset with Synthetic and Real-World MoD-Based Data Samples
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ADAM | Adaptive Moment Estimation |
AE | Autoencoders |
ANN | Artificial Neural Networks |
Con-GAE | Context Augmented Graph Autoencoder |
Conv2D AE | 2-Dimensional Convolution Layers |
DRT | Demand Responsive Transit |
FC AE | Fully Connected Layers |
FCD | Floating Car Data |
GAN | Generative Adversarial Networks |
KL | Kullback-Leibler |
LSTM | Long-Short Term Memory |
MoD | Mobility-on-Demand |
MSE | Mean Squared Error |
OD | Origin–Destination |
PCA | Principal Component Analysis |
t-SNE | t-distributed Stochastic Neighbour Embedding |
tSVD | truncated Singular Value Decomposition |
VAE | Variational Autoencoders |
VMR-GAE | Variational Multi-modal Recurrent Graph Auto-Encoder |
References
- Zardini, G.; Lanzetti, N.; Pavone, M.; Frazzoli, E. Analysis and Control of Autonomous Mobility-on-Demand Systems. Annu. Rev. Control Robot. Auton. Syst. 2022, 5, 633–658. [Google Scholar] [CrossRef]
- Atasoy, B.; Ikeda, T.; Song, X.; Ben-Akiva, M.E. The concept and impact analysis of a flexible mobility on demand system. Transp. Res. Part C Emerg. Technol. 2015, 56, 373–392. [Google Scholar] [CrossRef]
- He, Y.; Csiszár, C. Quality Assessment Method for Mobility as a Service. PROMET-Traffic Transp. 2020, 32, 611–624. [Google Scholar] [CrossRef]
- Iglesias, R.; Rossi, F.; Wang, K.; Hallac, D.; Leskovec, J.; Pavone, M. Data-Driven Model Predictive Control of Autonomous Mobility-on-Demand Systems. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 6019–6025. [Google Scholar] [CrossRef]
- Prado-Rujas, I.I.; Serrano, E.; García-Dopico, A.; Córdoba, M.L.; Pérez, M.S. Combining heterogeneous data sources for spatio-temporal mobility demand forecasting. Inf. Fusion 2023, 91, 1–12. [Google Scholar] [CrossRef]
- Qian, X.; Ukkusuri, S.V.; Yang, C.; Yan, F. Short-Term Demand Forecasting for on-Demand Mobility Service. IEEE Trans. Intell. Transp. Syst. 2022, 23, 1019–1029. [Google Scholar] [CrossRef]
- Boukerche, A.; Wang, J. Machine Learning-based traffic prediction models for Intelligent Transportation Systems. Comput. Netw. 2020, 181, 107530. [Google Scholar] [CrossRef]
- Gregurić, M.; Vrbanić, F.; Ivanjko, E. Towards the spatial analysis of motorway safety in the connected environment by using explainable deep learning. Knowl.-Based Syst. 2023, 269, 110523. [Google Scholar] [CrossRef]
- Sáez Trigueros, D.; Meng, L.; Hartnett, M. Generating photo-realistic training data to improve face recognition accuracy. Neural Netw. 2021, 134, 86–94. [Google Scholar] [CrossRef]
- Besnier, V.; Jain, H.; Bursuc, A.; Cord, M.; Pérez, P. This Dataset Does Not Exist: Training Models from Generated Images. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1–5. [Google Scholar] [CrossRef]
- Huang, D.; Song, X.; Fan, Z.; Jiang, R.; Shibasaki, R.; Zhang, Y.; Wang, H.; Kato, Y. A Variational Autoencoder Based Generative Model of Urban Human Mobility. In Proceedings of the 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 28–30 March 2019; pp. 425–430. [Google Scholar] [CrossRef]
- Boquet, G.; Morell, A.; Serrano, J.; Vicario, J.L. A variational autoencoder solution for road traffic forecasting systems: Missing data imputation, dimension reduction, model selection and anomaly detection. Transp. Res. Part C Emerg. Technol. 2020, 115, 102622. [Google Scholar] [CrossRef]
- Chiesa, S.; Taraglio, S. Traffic Request Generation through a Variational Auto Encoder Approach. Computers 2022, 11, 71. [Google Scholar] [CrossRef]
- Zhou, Q.; Lu, X.; Gu, J.; Zheng, Z.; Jin, B.; Zhou, J. Explainable Origin-Destination Crowd Flow Interpolation via Variational Multi-Modal Recurrent Graph Auto-Encoder. Proc. AAAI Conf. Artif. Intell. 2024, 38, 9422–9430. [Google Scholar] [CrossRef]
- Hu, Y.; Qu, A.; Work, D. Detecting Extreme Traffic Events Via a Context Augmented Graph Autoencoder. ACM Trans. Intell. Syst. Technol. 2022, 13, 1–23. [Google Scholar] [CrossRef]
- Islam, Z.; Abdel-Aty, M.; Cai, Q.; Yuan, J. Crash data augmentation using variational autoencoder. Accid. Anal. Prev. 2021, 151, 105950. [Google Scholar] [CrossRef] [PubMed]
- Yu, H.; Chen, X.; Li, Z.; Zhang, G.; Liu, P.; Yang, J.; Yang, Y. Taxi-Based Mobility Demand Formulation and Prediction Using Conditional Generative Adversarial Network-Driven Learning Approaches. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3888–3899. [Google Scholar] [CrossRef]
- Naji, H.A.H.; Xue, Q.; Zhu, H.; Li, T. Forecasting Taxi Demands Using Generative Adversarial Networks with Multi-Source Data. Appl. Sci. 2021, 11, 9675. [Google Scholar] [CrossRef]
- Razghandi, M.; Zhou, H.; Erol-Kantarci, M.; Turgut, D. Variational Autoencoder Generative Adversarial Network for Synthetic Data Generation in Smart Home. In Proceedings of the ICC 2022-IEEE International Conference on Communications, Seoul, Republic of Korea, 16–20 May 2022; pp. 4781–4786. [Google Scholar] [CrossRef]
- Hochmair, H.H. Spatiotemporal Pattern Analysis of Taxi Trips in New York City. Transp. Res. Rec. 2016, 2542, 45–56. [Google Scholar] [CrossRef]
- Singh, D.; Singh, B. Feature wise normalization: An effective way of normalizing data. Pattern Recognit. 2022, 122, 108307. [Google Scholar] [CrossRef]
- Wei, R.; Garcia, C.; El-Sayed, A.; Peterson, V.; Mahmood, A. Variations in Variational Autoencoders—A Comparative Evaluation. IEEE Access 2020, 8, 153651–153670. [Google Scholar] [CrossRef]
- Tong, Q.; Liang, G.; Bi, J. Calibrating the adaptive learning rate to improve convergence of ADAM. Neurocomputing 2022, 481, 333–356. [Google Scholar] [CrossRef]
- Tasoulis, S.; Pavlidis, N.G.; Roos, T. Nonlinear dimensionality reduction for clustering. Pattern Recognit. 2020, 107, 107508. [Google Scholar] [CrossRef]
- Wang, X.; Xu, Y. An improved index for clustering validation based on Silhouette index and Calinski-Harabasz index. IOP Conf. Ser. Mater. Sci. Eng. 2019, 569, 052024. [Google Scholar] [CrossRef]
- Mughnyanti, M.; Efendi, S.; Zarlis, M. Analysis of determining centroid clustering x-means algorithm with Davies-Bouldin index evaluation. IOP Conf. Ser. Mater. Sci. Eng. 2020, 725, 012128. [Google Scholar] [CrossRef]
Hyperparameter | Batch size | Learning rate | Regularization | Learning algorithm | Epoch | Learning/ Validation dataset split |
Value | 128 | 0.001 | Batch Normalization | ADAM | 50 | 80:20 |
RMSE | MAE | MAPE | |
---|---|---|---|
FC AE | 0.00138 | 0.0208 | 27.200 |
Conv2D AE | 0.00134 | 0.0203 | 26.283 |
Silhouette | |||||||
---|---|---|---|---|---|---|---|
Day in Week | t-SNE | tSVD | PCA | FC AE | Conv2D AE | FC VAE | Conv2D VAE |
Monday | −0.027 | 0.032 | 0.023 | 0.044 | −0.073 | 0.013 | 0.037 |
Tuesday | 0.017 | 0.052 | 0.053 | 0.094 | −0.044 | 0.051 | 0.059 |
Wednesday | 0.017 | 0.022 | 0.031 | 0.071 | −0.046 | 0.041 | 0.053 |
Thursday | −0.029 | −0.022 | −0.026 | 0.038 | −0.108 | 0.005 | 0.009 |
Friday | −0.029 | 0.022 | −0.008 | 0.051 | −0.066 | 0.043 | 0.030 |
Saturday | −0.028 | 0.036 | 0.015 | 0.019 | −0.062 | 0.056 | 0.020 |
Sunday | −0.037 | −0.021 | −0.049 | −0.005 | −0.089 | −0.005 | −0.012 |
Calinski−Harabasz | |||||||
Monday | 164.311 | 79.789 | 98.919 | 166.538 | 50.631 | 100.631 | 116.385 |
Tuesday | 227.420 | 126.133 | 154.008 | 226.726 | 135.298 | 169.694 | 189.848 |
Wednesday | 214.581 | 108.218 | 126.760 | 187.213 | 167.230 | 144.026 | 148.707 |
Thursday | 175.834 | 89.941 | 97.654 | 157.528 | 103.168 | 116.444 | 105.123 |
Friday | 196.488 | 107.494 | 120.546 | 168.360 | 122.636 | 135.600 | 167.400 |
Saturday | 92.709 | 122.497 | 118.645 | 163.305 | 57.210 | 151.381 | 118.362 |
Sunday | 55.707 | 56.387 | 65.554 | 101.866 | 25.980 | 77.698 | 85.096 |
Davies−Bouldin | |||||||
Monday | 5.404 | 2.713 | 3.151 | 2.378 | 3.300 | 3.138 | 4.100 |
Tuesday | 7.961 | 2.815 | 2.437 | 2.499 | 17.417 | 2.863 | 4.011 |
Wednesday | 5.699 | 2.932 | 2.851 | 2.822 | 10.432 | 3.375 | 4.177 |
Thursday | 9.839 | 4.014 | 3.604 | 3.417 | 8.149 | 3.662 | 4.151 |
Friday | 7.101 | 2.734 | 2.713 | 2.389 | 11.209 | 2.280 | 2.892 |
Saturday | 7.352 | 3.296 | 4.554 | 4.333 | 7.324 | 3.942 | 11.073 |
Sunday | 12.706 | 4.682 | 5.133 | 3.821 | 10.933 | 5.448 | 5.931 |
Shallow FC Layers Based ANN Model | ||||
---|---|---|---|---|
Layer Number | Type of Layer | Activation Function | Nodes | Dropout Rate |
1 | Dense (FC) | ReLu | 256 | |
2 | Dropout | 0.3 | ||
3 | Dense (FC) | ReLu | 64 | |
4 | Dropout | 0.3 | ||
5 | Dense (FC) | ReLu | 32 | |
6 | Dropout | 0.3 | ||
7 | Dense (FC) | Softmax | 3 |
2-D Convolution Based ANN Model | ||||||
---|---|---|---|---|---|---|
Layer Number | Type of Layer | Filter Number | Kernel Size | Strides | Activation Function | Nodes |
1 | Conv2D | 128 | (12,12) | (2,2) | ReLu | |
2 | Batch Normalization | |||||
3 | MaxPooling | (5,5) | (1,1) | |||
4 | Conv2D | 64 | (8,8) | (1,1) | ReLu | |
5 | Batch Normalization | |||||
6 | MaxPooling | (3,3) | (1,1) | |||
7 | Conv2D | 32 | (3,3) | (2,2) | ReLu | |
8 | Batch Normalization | |||||
9 | MaxPooling | (3,3) | (1,1) | |||
10 | Flatten | |||||
11 | Dense (FC) | ReLu | 32 | |||
12 | Dense (FC) | Softmax | 3 |
Training Dataset Configuration | Prediction Performance Metrics | |||||
---|---|---|---|---|---|---|
RMSE | MAE | Recall | Precision | F1 Score | ||
Initial Training Dataset | 0.142 | 0.027 | 0.962 | 0.963 | 0.963 | |
Initial training dataset augmented by VAEs decoder | +12% synthetic samples for each cluster | 0.151 | 0.026 | 0.962 | 0.962 | 0.962 |
+11% synthetic samples for each cluster | 0.145 | 0.023 | 0.964 | 0.964 | 0.964 | |
+10% synthetic samples for each cluster | 0.136 | 0.022 | 0.968 | 0.968 | 0.967 | |
+9% synthetic samples for each cluster | 0.135 | 0.024 | 0.968 | 0.968 | 0.967 | |
+8% synthetic samples for each cluster | 0.132 | 0.020 | 0.970 | 0.970 | 0.970 | |
+7% synthetic samples for each cluster | 0.137 | 0.021 | 0.968 | 0.961 | 0.968 | |
+5% synthetic samples for each cluster | 0.133 | 0.023 | 0.967 | 0.967 | 0.967 | |
+3% synthetic samples for each cluster | 0.135 | 0.022 | 0.970 | 0.970 | 0.970 | |
+10% synthetic samples only for morning cluster | 0.139 | 0.024 | 0.965 | 0.966 | 0.966 | |
+10% synthetic samples only for afternoon cluster | 0.134 | 0.022 | 0.970 | 0.970 | 0.970 | |
+10% synthetic samples only for evening cluster | 0.148 | 0.031 | 0.962 | 0.963 | 0.963 |
Training Dataset Configuration | Prediction Performance Metrics | |||||
---|---|---|---|---|---|---|
RMSE | MAE | Recall | Precision | F1 Score | ||
Initial Training Dataset | 0.179 | 0.042 | 0.939 | 0.941 | 0.943 | |
Initial training dataset augmented by VAEs decoder | +12% synthetic samples for each cluster | 0.339 | 0.168 | 0.789 | 0.776 | 0.783 |
+11% synthetic samples for each cluster | 0.309 | 0.105 | 0.845 | 0.845 | 0.845 | |
+10% synthetic samples for each cluster | 0.162 | 0.028 | 0.957 | 0.608 | 0.958 | |
+9% synthetic samples for each cluster | 0.147 | 0.026 | 0.962 | 0.962 | 0.962 | |
+8% synthetic samples for each cluster | 0.148 | 0.024 | 0.964 | 0.964 | 0.964 | |
+7% synthetic samples for each cluster | 0.156 | 0.026 | 0.960 | 0.960 | 0.960 | |
+5% synthetic samples for each cluster | 0.152 | 0.025 | 0.961 | 0.961 | 0.961 | |
+3% synthetic samples for each cluster | 0.149 | 0.024 | 0.964 | 0.964 | 0.964 | |
+10% synthetic samples only for morning cluster | 0.161 | 0.03 | 0.958 | 0.958 | 0.958 | |
+10% synthetic samples only for afternoon cluster | 0.163 | 0.029 | 0.957 | 0.957 | 0.957 | |
+10% synthetic samples only for evening cluster | 0.202 | 0.054 | 0.919 | 0.926 | 0.922 |
Training Dataset Configuration | Prediction Performance Metrics | |||||
---|---|---|---|---|---|---|
RMSE | MAE | Recall | Precision | F1 Score | ||
Initial Training Dataset | 0.179 | 0.042 | 0.939 | 0.941 | 0.943 | |
Initial training dataset augmented by VAEs decoder | +12% synthetic samples for each cluster | 0.327 | 0.116 | 0.826 | 0.827 | 0.826 |
+11% synthetic samples for each cluster | 0.288 | 0.060 | 0.908 | 0.913 | 0.910 | |
+10% synthetic samples for each cluster | 0.242 | 0.070 | 0.901 | 0.904 | 0.902 | |
+9% synthetic samples for each cluster | 0.431 | 0.201 | 0.918 | 0.925 | 0.922 | |
+8% synthetic samples for each cluster | 0.328 | 0.130 | 0.796 | 0.799 | 0.797 | |
+7% synthetic samples for each cluster | 0.271 | 0.127 | 0.948 | 0.947 | 0.947 | |
+5% synthetic samples for each cluster | 0.157 | 0.027 | 0.957 | 0.951 | 0.959 | |
+3% synthetic samples for each cluster | 0.157 | 0.024 | 0.961 | 0.965 | 0.963 | |
+10% synthetic samples only for morning cluster | 0.196 | 0.049 | 0.931 | 0.932 | 0.932 | |
+10% synthetic samples only for the afternoon cluster | 0.234 | 0.064 | 0.905 | 0.909 | 0.907 | |
+10% synthetic samples only for evening cluster | 0.441 | 0.208 | 0.686 | 0.688 | 0.687 |
Prediction Accuracy% | ||||
---|---|---|---|---|
Model Which Generated
Synthetic Samples |
2D- Convolution
VAE |
2D-Convolution
GAN | ||
Training
Dataset Configuration |
Shallow
ANN Model with FC Layers |
2D
Convolutional ANN Model |
2D
Convolutional ANN Model | |
Initial Training Dataset | 81 | 83 | 83 | |
Initial training dataset augmented by VAEs decoder | +12% synthetic samples for each cluster | 78 | 98 | 85 |
+11% synthetic samples for each cluster | 78 | 98 | 84 | |
+10% synthetic samples for each cluster | 77 | 82 | 84 | |
+9% synthetic samples for each cluster | 78 | 91 | 83 | |
+8% synthetic samples for each cluster | 78 | 95 | 80 | |
+7% synthetic samples for each cluster | 79 | 94 | 84 | |
+5% synthetic samples for each cluster | 79 | 93 | 85 | |
+3% synthetic samples for each cluster | 78 | 97 | 85 | |
+10% synthetic samples only for morning cluster | 79 | 91 | 83 | |
+10% synthetic samples only for afternoon cluster | 79 | 97 | 84 | |
+10% synthetic samples only for evening cluster | 80 | 97 | 84 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gregurić, M.; Vrbanić, F.; Ivanjko, E. Towards Synthetic Augmentation of Training Datasets Generated by Mobility-on-Demand Service Using Deep Variational Autoencoders. Appl. Sci. 2025, 15, 4708. https://doi.org/10.3390/app15094708
Gregurić M, Vrbanić F, Ivanjko E. Towards Synthetic Augmentation of Training Datasets Generated by Mobility-on-Demand Service Using Deep Variational Autoencoders. Applied Sciences. 2025; 15(9):4708. https://doi.org/10.3390/app15094708
Chicago/Turabian StyleGregurić, Martin, Filip Vrbanić, and Edouard Ivanjko. 2025. "Towards Synthetic Augmentation of Training Datasets Generated by Mobility-on-Demand Service Using Deep Variational Autoencoders" Applied Sciences 15, no. 9: 4708. https://doi.org/10.3390/app15094708
APA StyleGregurić, M., Vrbanić, F., & Ivanjko, E. (2025). Towards Synthetic Augmentation of Training Datasets Generated by Mobility-on-Demand Service Using Deep Variational Autoencoders. Applied Sciences, 15(9), 4708. https://doi.org/10.3390/app15094708