Principal Component Random Forest for Passenger Demand Forecasting in Cooperative, Connected, and Automated Mobility

Spanos, Georgios; Lalas, Antonios; Votis, Konstantinos; Tzovaras, Dimitrios

doi:10.3390/su17062632

Open AccessArticle

Principal Component Random Forest for Passenger Demand Forecasting in Cooperative, Connected, and Automated Mobility

by

Georgios Spanos

^*

,

Antonios Lalas

,

Konstantinos Votis

and

Dimitrios Tzovaras

Information Technologies Institute, Centre for Research and Technology Hellas, 57001 Thessaloniki, Greece

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(6), 2632; https://doi.org/10.3390/su17062632

Submission received: 13 February 2025 / Revised: 9 March 2025 / Accepted: 13 March 2025 / Published: 17 March 2025

(This article belongs to the Special Issue Cooperative, Connected, and Automated Mobility (CCAM) toward Sustainability)

Download

Browse Figures

Versions Notes

Abstract

:

Cooperative, Connected, and Automated Mobility (CCAM) is set to play a key role in the future of transportation, contributing to the achievement of sustainable development goals. Moreover, Artificial Intelligence (AI), a transformative technology with applications across various industries, can significantly enhance CCAM operations. Additionally, passenger demand forecasting, a critical aspect of mobility research, will become even more essential as CCAM adoption continues to grow in the next years. Therefore, the present research study, in order to deal with the issue of passenger demand forecasting in CCAM, proposes the Principal Component Random Forest (PCRF) methodology, which is based on AI, as it leverages a well-established statistical methodology such as the Principal Components Analysis with a flagship traditional machine learning technique, which is Random Forest. The application of PCRF in four European pilot sites within the European Union-funded SHOW project demonstrated its high accuracy and effectiveness as reflected by the average normalized error of approximately 15%.

Keywords:

machine learning; principal component analysis; random forest; passenger demand forecasting; urban mobility; automated mobility

1. Introduction

Undoubtedly, Cooperative, Connected, and Automated Mobility (CCAM) is a groundbreaking technology [1] in contemporary society and a promising solution for advancing the Sustainable Development Goals (SDGs) [2]. In particular, SDG 11—Sustainable Cities and Communities—and SDG 13—Climate Action—could greatly benefit from this transformative mode of transportation [3] for people and goods, as Automated Vehicles (AVs) primarily rely on electricity [4]. Even more significantly, integrating automated mobility with public urban transportation could contribute to a greener environment by reducing reliance on private cars, thereby decreasing traffic congestion and its associated environmental impact [5].

It is undeniable that Artificial Intelligence (AI) and Big Data (BD) are revolutionizing factors in the modern business environment [6], demonstrating their prowess across various sectors, including healthcare [7], agriculture [8], energy [9], tourism [10] and Industry 5.0 [11]. More specifically, a recent literature review [12] highlights the potential of AI in the CCAM sector. Hence, recent research efforts leveraging AI and BD in CCAM [13] are continuously increasing, enhancing the benefits for commuters by providing valuable solutions to common urban transportation challenges, such as Estimated Time of Arrival prediction [14] and Accident Detection [15].

Passenger demand forecasting is one of the primary concerns in transportation research [16], both for scheduled and unscheduled transportation. Indeed, an accurate prediction of passenger demand can facilitate optimal resource planning and allocation [17] across various transportation modes, including airplanes [18], ships [19], buses [20], and taxis [21]. This, in turn, can lead to economic benefits for transport operators [22], whether public or private. Passenger demand forecasting, like all demand forecasting problems, is a typical time series problem, requiring the use of appropriate algorithms designed for time series analysis and prediction [23].

According to the aforementioned, it is imperative to find a viable solution to the existing challenge of accurately forecasting passenger demand in CCAM by leveraging advanced technological tools such as machine learning (ML) and robust statistical analysis. For this reason, the research objective of the current work is to suggest a methodology called Principal Component Random Forest (PCRF), which combines statistical and ML techniques, namely, the Principal Component Analysis (PCA) and Random Forest (RF), to predict daily passenger demand for automated vehicles operating in both scheduled and non-scheduled transportation services, using real data from four different pilot sites across Europe of the SHOW project https://show-project.eu/, accessed on 12 February 2025.

The rest of the article is organized as follows. Section 2 describes the related work of the research field, while Section 3 presents the proposed methodology, PCRF, for passenger demand forecasting. Section 4 discusses the experimental results of applying PCRF to five real-life datasets and, finally, Section 5 summarizes the key conclusions of this research.

2. Related Work

As mentioned in the Introduction, passenger demand forecasting in urban transportation plays a prominent role for transport operators. Indeed, there are various research studies in the literature that have attempted to address this challenge efficiently. A recent literature review emphasizes the importance of incorporating spatial and temporal characteristics to enhance the accuracy of passenger demand prediction [24]. According to the same systematic review, related works in this field primarily fall into two main categories: statistical-based methodologies and modern AI-based approaches. The following subsections present some representative studies from both categories.

2.1. Statistical-Based Methodologies

Starting with the statistical-based category, Xue et al. [25] proposed a methodology utilizing time-series models and the interactive multiple model (IMM) algorithm to predict short-term passenger demand for a specific bus route in Shenzhen, China. They constructed time series at three different time levels (15-minute, daily, and weekly) and analyzed the heteroscedasticity of the data to enhance model performance. Additionally, they utilized the IMM algorithm to combine individual forecasting models with dynamic passenger demand forecasting for the next time step. Their results demonstrated the accuracy of the proposed approach when applied to a four-month dataset, achieving a Mean Absolute Percentage Error (MAPE) below 10%.

Tang et al. [26] suggested a statistical approach based on seasonal decomposition to predict short-term subway ridership of the Chongqing Rail Transit in China. The main contributions of the authors were treating a combined forecasting problem as an optimization problem and proposing the In-Sample and Out-of-Sample algorithms. For their individual forecasting models, they conducted experiments using the Auto Regressive Integrated Moving Average (ARIMA) model. Their approach achieved high accuracy, with an MAPE below 8%.

Tao et al. [27] examined the impact of weather on passenger demand by incorporating weather-related variables into a statistical model, which is the Seasonal ARIMA with exogenous variables model. They used multiple time series models to capture both the lagged and concurrent effects of weather on bus ridership. Using a dataset of hourly bus rides in Brisbane, Australia, their study demonstrated a strong correlation between real-time weather variables—such as temperature and wind speed—and passenger demand.

2.2. AI-Based Methodologies

Continuing with the more advanced and sophisticated AI-based approaches, Hao et al. [28] employed a deep learning model that combines sequence-to-sequence learning with an attention mechanism for short-term passenger demand prediction in metro stations. Their study used data from two major Singapore metro stations—Raffles Place Mass Rapid Transit (MRT) and Clarke Quay MRT. The proposed methodology proved highly effective, achieving significantly lower prediction errors compared to traditional ML benchmarks.

Liu et al. [29] to predict the hourly passenger demand load in metro of Taipei, Taiwan, used a deep Long Short-Term Memory (LSTM) neural network (NN). To enhance the accuracy of their deep LSTM NN model, they incorporated weather variables such as temperature and wind speed. The results indicated that the proposed methodology is very accurate, achieving an MAPE of approximately 5%, although the inclusion of weather variables did not lead to a significant improvement in prediction accuracy.

Liu et al. [30] integrated decision trees for feature engineering with a deep learning network to forecast passenger demand for public transport buses in Nanjing, China. In particular, they modeled and predicted passenger demand using decision trees, and the results of this analysis were fed into the next step, which involved their convolutional NN. Their results demonstrated superior performance compared to other NN-based approaches.

2.3. Research Gap and Novelty of the Proposed Work

It is evident from the aforementioned discussion that there is a research gap related to the passenger demand forecasting in the CCAM research field. Hence, this study tries to cover this gap and differentiates itself from related works, as it represents the first attempt to predict daily passenger demand in the CCAM, which is a unique mode of transportation with distinct characteristics compared to conventional transport systems, using five real-life datasets. Moreover, to the best of our knowledge, the proposed PCRF methodology, which is an iterative approach combining PCA with RF, is being applied for the first time in a demand forecasting problem, and more specifically for passenger demand forecasting in CCAM for urban transportation.

3. Methodology

3.1. Overview

As mentioned in the introduction, a new methodology is proposed in the present study to predict the daily passenger demand in CCAM. The proposed PCRF methodology adopts an iterative approach and is based on statistical and ML methodologies. The combination of statistical and ML methodologies is followed in this research, as it leads very frequently to better forecasting accuracy according to the literature [31,32,33,34]. In particular, the proposed methodology leverages (i) the statistical methodology of PCA that reduces the variable dimensionality [35,36], thus providing reduced variance [37] with (ii) the flagship ML methodology of Random Forest that is extremely efficient in many classification and regression problems in terms of prediction accuracy as proved from the literature of various disciplines [38,39].

3.2. Algorithmic Procedure

PCA [40,41] transforms high-dimensional data into a smaller set of uncorrelated variables called principal components. It works by identifying directions of maximum variance in the data and projecting it onto these new axes, preserving as much information as possible while reducing complexity. Random Forest [42,43] is a powerful ensemble learning algorithm that builds multiple decision trees and combines their outputs to improve accuracy and reduce overfitting. It works by training each tree on a random subset of the data and averaging their predictions (for regression) or using majority voting (for classification). The overall methodology, which is based on the two aforementioned methodologies, is presented in detail in the following paragraphs.

The first step of the PCRF constitutes the input of historical data on daily passenger demand. A very simple approach is followed in this study regarding the input data as the proposed algorithm uses as input only the historical previous values for a specific route or a specific region. This consideration is met very frequently in the literature for similar time series problems, as keeps the simplicity of a methodology accompanied by very accurate results [44,45,46].

The next step of PCRF includes the data preprocessing by taking the input from the first step and by performing the required preprocessing of the historical demand, which constitutes a time series, in order to transform [47] it in a form suitable for the chosen ML algorithm, which is Random forest. Thus, the historical demand is transformed in such a way that each of the previous values (passenger daily demand in this case) is in essence a feature for the algorithms. That way, the training of the algorithm considers appropriately the chronological order (i.e., the future values are not used for the training of the past values), and the training does not have the bias problem [39,47].

After the data preprocessing, which is described in detail above, the third step of the methodology is the training of the Random Forest algorithm, which takes place in the corresponding training set. At this point, it is worth noting the splitting of the datasets takes place in three parts: the training set, in which the algorithm is trained iteratively; the validation set, in which the fine-tuning of the algorithm takes place; and the test set, which constitutes the last. The splitting of the datasets in these three parts follows the best practices of the literature [48]. In the next paragraphs, the fine-tuning of the methodology in the validation set is analyzed.

The selection of the optimal number of historical passenger demand values constitutes the fourth step of the proposed methodology. In order to do this selection, as mentioned before, the validation set of each sample is used. It is worth mentioning at this point, that the maximum number of historical passenger demand values depends on the dataset size. Therefore, and due to the differentiation in the size of the five datasets used, as presented in the Results section, a different range of historical values is considered in this step for the datasets. For the selection of the optimal number of previous days, the Root Mean Square Error (RMSE) evaluation metric is used and this consideration is aligned with the respective literature [49].

After the selection of the optimal previous days, the fifth step of the methodology contains the PCA conduction in order to reduce the number of features and create new uncorrelated variables [40]. It is worth mentioning that the dataset in which PCA is performed consists of the historical values defined in the previous step as columns and each row corresponds to a specific date. As previously, during the procedure of the algorithm fine-tuning, the selection of the number of components is performed in the validation set considering the RMSE evaluation criterion [49].

Next, the proposed algorithm checks if the transformed values from the PCA achieve improved performance in comparison with the original dataset. Therefore, depending on the previous decision of the algorithm, the procedure continues with the original or the transformed/reduced from the PCA dataset. Since, this decision is essential, a fine-tuning of the methodology is conducted as with all the fine-tuning steps presented in the validation set and the RMSE evaluation criterion, which decides with which dataset the algorithm will continue.

The final step of the fine-tuning procedures corresponds to the tuning of the hyper parameters of Random Forest (number of trees/features per split), considering among others the values suggested by Breiman [42] for these hyper parameters. Once again, the tuning of the Random Forest hyper parameters is performed in the validation set and the criterion for the fine-tuning is RMSE.

Finally, considering the small size of the five available datasets (more details regarding the dataset size are provided in the Results section), the well-established time series problems and simple approach of next-step forecasting [44,50] are applied. Mathematically, the derived model predicts the passenger demand for the next day (

d e m a n d_{t + 1}

) solely based on the historical values selected during the optimal historical days selection process. This can be expressed as:

d e m a n d_{t + 1} = f (d e m a n d_{t - n + 1}, \dots, d e m a n d_{t})

where n represents the number of optimal historical days determined by the algorithm. The overall procedure of the proposed PCRF methodology is depicted in Figure 1.

3.3. Dataset Collection and Splitting

As mentioned in the Introduction, the validation of the proposed methodology is conducted using five datasets from four different pilot sites of the SHOW project. It is important to note that, although CCAM has been adopted and demonstrated for several months across all four pilots, there are substantial differences in the maturity of the technology. As a result, significant variations exist in the dataset sizes, as shown in the Results section.

Regarding data collection, differences exist between the four pilot sites. In some cases, passenger counting is not automated, requiring the safety driver to manually record this information. In other cases, where an unscheduled transportation service operates, a booking system is in place, from which passenger demand data is extracted. Regardless of whether the recording is automated or manual and whether the transportation is scheduled or on-demand, the aggregated daily passenger demand is used for all pilot sites.

The differences in dataset sizes also affect how the data are split into training, validation, and test sets, as well as the maximum range of optimal historical days that the algorithm can select. However, the same fundamental principles are applied across all five datasets. This can be illustrated by the following example: Considering a dataset containing 50 records, the optimal historical days are limited to 10 days, the training set is set to 30 days (from which the number of optimal historical days must be deducted since they serve as input variables for the algorithm), and the validation and test sets each contain 10 records. Consequently, in larger datasets, both the size of the training/validation/test set and the maximum range of optimal historical days increases, while in smaller datasets, the opposite occurs.

3.4. Evaluation Metrics

This subsection presents the evaluation metrics used in this study to validate the proposed PCRF methodology. In line with the literature [51], well-established evaluation metrics for regression problems are employed, including Mean Absolute Error (MAE), Median Absolute Error (MdAE), and RMSE, along with their normalized versions (NMAE, NMdAE, and NRMSE) [44]. The following formulas illustrate how these metrics are computed:

M A E = \frac{1}{n} \sum_{i = 1}^{n} |e_{i}|

(1)

where

e_{i}

is the error between actual and predicted values and n is the number of predicted demand values.

M d A E = m e d i a n (e_{i})

(2)

a metric more robust and resilient against outliers compared to

M A E

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} e_{i}^{2}}

(3)

a stricter metric than the

M A E

, as it penalizes the big errors more

N M A E = \frac{M A E}{m a x (y) - m i n (y)}

(4)

where y is the vector of actual passenger demand values

N M d A E = \frac{M d A E}{m a x (y) - m i n (y)}

(5)

N R M S E = \frac{R M S E}{m a x (y) - m i n (y)}

(6)

The last three normalized metrics are very useful for ML regression problems, as they provide insights regarding the error magnitude and the general performance of the methodologies regardless of the research field and the value level.

4. Results

This section presents the results of the proposed PCRF methodology for daily passenger demand forecasting across the following pilot sites:

Tampere, Finland (two different running phases).
Frankfurt, Germany.
Carinthia, Austria.
Trikala, Greece.

4.1. Tampere 1st Phase

The first period of the Tampere pilot site, which includes passenger demand data, spans from 5 January 2022 to 10 March 2022. During this period, there were 44 operational days. Of these, 34 days were used for the training set, while five days were allocated for both the validation and test sets. The forecasting results are illustrated in Figure 2, where the ability of the proposed methodology to capture passenger demand magnitude can be observed.

4.2. Tampere 2nd Phase

The second period used for demand forecasting at the Tampere pilot site is significantly longer than the first, spanning approximately six months from 9 January 2023 to 30 June 2023. Consequently, the number of operational days in this period is nearly three times higher than in the first, increasing from 44 to 117 days. As expected, the training, validation, and testing sets are also larger, consisting of 77, 20, and 20 days, respectively. Finally, as shown in Figure 3, the forecasting results indicate that, in addition to accurately capturing passenger demand magnitude, the proposed methodology also effectively captures demand patterns due to the larger dataset.

4.3. Frankfurt

In Frankfurt, passenger data are available for 159 operational days from 1 December 2022 to 30 June 2023. For this pilot site, the dataset was divided into 99 days for training, 30 days for validation, and 30 days for testing. Finally, as shown in Figure 4, the forecasting results demonstrate that the proposed methodology effectively captures the passenger demand pattern.

4.4. Carinthia

In Carinthia, passenger data are available for only 29 operational days from 21 September 2021 to 12 November 2021. As expected, the dataset for training, validation, and testing is relatively small, consisting of 19, 5, and 5 days, respectively. Figure 5 presents the actual and forecasted passenger demand values, demonstrating that the proposed methodology successfully captures the passenger demand levels.

4.5. Trikala

In Trikala, passenger data are available for 38 operational days from 1 February 2024 to 29 March 2024. Given the relatively small dataset, the training, validation, and test sets consist of 28, 5, and 5 days, respectively. As shown in Figure 6, the PCRF methodology successfully captured the passenger demand level.

4.6. Evaluation Results

Table 1 presents a summary of the results across all pilot sites based on the evaluation metrics analyzed in Section 3.4. As shown, the proposed forecasting methodology achieved consistently strong performance, with error rates ranging from below 6% (NMdAE for Carinthia) to below 30% (NRMSE for Trikala) and an average normalized error of approximately 15%. More specifically, except for the Trikala pilot site, where all the results of the evaluation metrics indicate a moderate performance of the proposed algorithm (25–30% error and an absolute passenger error in the range of 25–30 passengers), the results in all other pilot sites show high accuracy, with percentage errors either around 10% (for Carinthia, Frankfurt, and the first period of Tampere) or between 15 and 20% for the second running phase of Tampere. The aforementioned outcome is particularly promising, given the limited dataset size for most pilot sites, further demonstrating the effectiveness of the proposed approach.

4.7. Comparative Results

As a final step to validate the proposed PCRF methodology and in accordance with Hyndman’s guidelines [52] for methodology benchmarking, Table 2 presents a comparison of the proposed approach with baseline forecasting methods such as Naïve, Average, and Drift. Since the primary focus of this table is comparison, only the normalized versions of MAE, MdAE, and RMSE are displayed. This is because a method that achieves the best results in the normalized metric (e.g., NMAE) will also yield the best results in the base metric (e.g., MAE), making the inclusion of non-normalized metrics redundant.

It is obvious from Table 2 that the proposed methodology performed better than the baseline methodologies of Naive, Average, and Drift. More particularly, the suggested PCRF methodology was the best forecasting methodology according to the evaluation metrics 8 out of 15 times (3 evaluation metrics for the five datasets). The superiority of the suggested forecasting methodology against the baseline methodologies is also proven from the fact that the largest datasets (Frankfurt, Tampere second phase, and Tampere first phase with 159, 127, and 44 records, respectively) has the best results 7 out of 9 times. Finally, another point indicating the superiority of the proposed PCRF methodology is that in the three largest datasets, PCRF achieved the best NRMSE, which constitutes the normalized version of the most rigorous evaluation metric as described in the evaluation metrics subsection.

5. Conclusions

The present article introduced a novel iterative methodology for passenger demand prediction in CCAM, leveraging statistical and machine learning methodologies, such as PCA and Random Forest. The application of the proposed PCRF methodology to five real-life datasets from four pilot sites demonstrated its reliability and robustness, regarding forecasting accuracy, even when working with limited data. The aforementioned statement related to the PCRF efficiency is justified by the results of the evaluation metrics, which indicated an average normalized error of approximately 15%. Additionally, the comparative analysis against baseline forecasting methods—widely recognized for their simplicity and efficiency with small datasets—highlighted the superiority of the proposed approach and its potential.

As future work, the suggested PCRF methodology will be further enhanced by incorporating features known to influence passenger demand in urban mobility, such as weather conditions (temperature, humidity, luminosity, etc.) and calendar information (day of the week, holidays etc.). The inclusion of such factors will provide an opportunity to create a more complex and sophisticated model capable of capturing more intricate demand patterns. Finally, a crucial future action toward the generalizability of the PCRF methodology will be its application to larger CCAM datasets, allowing for comparisons with more sophisticated and complex AI methodologies, including deep learning approaches and algorithms from the boosting family.

Author Contributions

Conceptualization, G.S.; methodology, G.S.; software, G.S.; validation, G.S.; formal analysis, G.S.; investigation, G.S.; data curation, G.S.; writing—original draft preparation, G.S.; writing—review and editing, G.S.; visualization, G.S.; supervision, A.L.; project administration, A.L. and K.V.; funding acquisition, K.V. and D.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially funded by the European’s Union Horizon 2020 Research and Innovation Program through SHared automation Operating models for Worldwide adoption (SHOW) under Grant Agreement No. 875530.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are unavailable due to privacy or ethical restrictions.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Gruyer, D.; Orfila, O.; Glaser, S.; Hedhli, A.; Hautière, N.; Rakotonirainy, A. Are connected and automated vehicles the silver bullet for future transportation challenges? Benefits and weaknesses on safety, consumption, and traffic congestion. Front. Sustain. Cities 2021, 2, 607054. [Google Scholar] [CrossRef]
Hák, T.; Janoušková, S.; Moldan, B. Sustainable Development Goals: A need for relevant indicators. Ecol. Indic. 2016, 60, 565–573. [Google Scholar] [CrossRef]
Chehri, A.; Mouftah, H.T. Autonomous vehicles in the sustainable cities, the beginning of a green adventure. Sustain. Cities Soc. 2019, 51, 101751. [Google Scholar] [CrossRef]
Taiebat, M.; Brown, A.L.; Safford, H.R.; Qu, S.; Xu, M. A review on energy, environmental, and sustainability implications of connected and automated vehicles. Environ. Sci. Technol. 2018, 52, 11449–11465. [Google Scholar] [CrossRef]
Lazarus, J.; Shaheen, S.; Young, S.E.; Fagnant, D.; Voege, T.; Baumgardner, W.; Fishelson, J.; Sam Lott, J. Shared Automated Mobility and Public Transport; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
Obschonka, M.; Audretsch, D.B. Artificial intelligence and big data in entrepreneurship: A new era has begun. Small Bus. Econ. 2020, 55, 529–539. [Google Scholar] [CrossRef]
Benke, K.; Benke, G. Artificial intelligence and big data in public health. Int. J. Environ. Res. Public Health 2018, 15, 2796. [Google Scholar] [CrossRef]
Wolfert, S.; Ge, L.; Verdouw, C.; Bogaardt, M.J. Big data in smart farming—A review. Agric. Syst. 2017, 153, 69–80. [Google Scholar] [CrossRef]
Li, J.; Herdem, M.S.; Nathwani, J.; Wen, J.Z. Methods and applications for Artificial Intelligence, Big Data, Internet of Things, and Blockchain in smart energy management. Energy AI 2023, 11, 100208. [Google Scholar] [CrossRef]
Samara, D.; Magnisalis, I.; Peristeras, V. Artificial intelligence and big data in tourism: A systematic literature review. J. Hosp. Tour. Technol. 2020, 11, 343–367. [Google Scholar] [CrossRef]
Rijwani, T.; Kumari, S.; Srinivas, R.; Abhishek, K.; Iyer, G.; Vara, H.; Dubey, S.; Revathi, V.; Gupta, M. Industry 5.0: A review of emerging trends and transformative technologies in the next industrial revolution. Int. J. Interact. Des. Manuf. (IJIDEM) 2024, 19, 667–679. [Google Scholar] [CrossRef]
Eshetu, A.; Valilai, O.F.; Wicaksono, H. Unveiling the Potential of Artificial Intelligence in Cooperative, Connected, and Automated Mobility (CCAM) Solutions: A Systematic Literature Review. In Proceedings of the 2024 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), Bangkok, Thailand, 15–18 December 2024; pp. 1272–1276. [Google Scholar]
Spanos, G.; Siomos, A.; Schmidt, C.; Tygesen, M.; Salanova, J.M.; Rodrigues, F.; Papadopoulos, A.; Antypas, E.; Sersemis, A.; Gemou, M.; et al. Services for Connected, Cooperated, and Automated Mobility based on Big Data and Artificial Intelligence: The SHOW project paradigm. Open Res. Eur. 2025, 5, 24. [Google Scholar] [CrossRef]
Antypas, E.; Spanos, G.; Lalas, A.; Votis, K.; Tzovaras, D. A time series approach for estimated time of arrival prediction in autonomous vehicles. Transp. Res. Procedia 2024, 78, 166–173. [Google Scholar] [CrossRef]
Papadopoulos, A.; Sersemis, A.; Spanos, G.; Lalas, A.; Liaskos, C.; Votis, K.; Tzovaras, D. Lightweight accident detection model for autonomous fleets based on GPS data. Transp. Res. Procedia 2024, 78, 16–23. [Google Scholar] [CrossRef]
Banister, D. Sustainable transport: Challenges and opportunities. Transportmetrica 2007, 3, 91–106. [Google Scholar] [CrossRef]
Banerjee, N.; Morton, A.; Akartunalı, K. Passenger demand forecasting in scheduled transportation. Eur. J. Oper. Res. 2020, 286, 797–810. [Google Scholar] [CrossRef]
Higgoda, R.; Madurapperuma, M. Dynamic Nexus between Air-Transportation and Economic Growth: A Systematic Literature Review. J. Transp. Technol. 2019, 9, 156–170. [Google Scholar] [CrossRef]
Xu, M.; Ma, X.; Zhao, Y.; Qiao, W. A Systematic Literature Review of Maritime Transportation Safety Management. J. Mar. Sci. Eng. 2023, 11, 2311. [Google Scholar] [CrossRef]
Sogbe, E.; Susilawati, S.; Pin, T.C. Scaling up public transport usage: A systematic literature review of service quality, satisfaction and attitude towards bus transport systems in developing countries. Public Transp. 2024, 1–44. [Google Scholar] [CrossRef]
Lyu, T.; Wang, P.S.; Gao, Y.; Wang, Y. Research on the big data of traditional taxi and online car-hailing: A systematic review. J. Traffic Transp. Eng. Engl. Ed. 2021, 8, 1–34. [Google Scholar] [CrossRef]
Zachariah, R.A.; Sharma, S.; Kumar, V. Systematic review of passenger demand forecasting in aviation industry. Multimed. Tools Appl. 2023, 82, 46483–46519. [Google Scholar] [CrossRef]
Ingle, C.; Bakliwal, D.; Jain, J.; Singh, P.; Kale, P.; Chhajed, V. Demand forecasting: Literature review on various methodologies. In Proceedings of the 2021 12th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kharagpur, India, 6–8 July 2021; pp. 1–7. [Google Scholar]
Nithin, K.S.; Mulangi, R.H. Spatio-Temporal Factors Affecting Short-Term Public Transit Passenger Demand Prediction: A Review. In Proceedings of the International Conference on Transportation Planning and Implementation Methodologies for Developing Countries, Mumbai, India, 18–20 December 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 421–430. [Google Scholar]
Xue, R.; Sun, D.; Chen, S. Short-term bus passenger demand prediction based on time series model and interactive multiple model approach. Discret. Dyn. Nat. Soc. 2015, 2015, 682390. [Google Scholar] [CrossRef]
Tang, J.; Zuo, A.; Liu, J.; Li, T. Seasonal decomposition and combination model for short-term forecasting of subway ridership. Int. J. Mach. Learn. Cybern. 2022, 13, 145–162. [Google Scholar] [CrossRef]
Tao, S.; Corcoran, J.; Rowe, F.; Hickman, M. To travel or not to travel:‘Weather’is the question. Modelling the effect of local weather conditions on bus ridership. Transp. Res. Part C Emerg. Technol. 2018, 86, 147–167. [Google Scholar] [CrossRef]
Hao, S.; Lee, D.H.; Zhao, D. Sequence to sequence learning with attention mechanism for short-term passenger flow prediction in large-scale metro system. Transp. Res. Part C Emerg. Technol. 2019, 107, 287–300. [Google Scholar] [CrossRef]
Liu, L.; Chen, R.C.; Zhu, S. Impacts of weather on short-term metro passenger flow forecasting using a deep LSTM neural network. Appl. Sci. 2020, 10, 2962. [Google Scholar] [CrossRef]
Liu, Y.; Lyu, C.; Liu, X.; Liu, Z. Automatic feature engineering for bus passenger flow prediction based on modular convolutional neural network. IEEE Trans. Intell. Transp. Syst. 2020, 22, 2349–2358. [Google Scholar] [CrossRef]
Koutroumanidis, T.; Sylaios, G.; Zafeiriou, E.; Tsihrintzis, V.A. Genetic modeling for the optimal forecasting of hydrologic time series: Application in Nestos River. J. Hydrol. 2009, 368, 156–164. [Google Scholar] [CrossRef]
Caliwag, A.C.; Lim, W. Hybrid VARMA and LSTM method for lithium-ion battery state-of-charge and output voltage forecasting in electric motorcycle applications. IEEE Access 2019, 7, 59680–59689. [Google Scholar] [CrossRef]
Hong, J.; Wang, Z.; Chen, W.; Wang, L.Y.; Qu, C. Online joint-prediction of multi-forward-step battery SOC using LSTM neural networks and multiple linear regression for real-world electric vehicles. J. Energy Storage 2020, 30, 101459. [Google Scholar] [CrossRef]
Athanasakis, E.; Spanos, G.; Papadopoulos, A.; Lalas, A.; Votis, K.; Tzovaras, D. A Comprehensive Leakage-Free Forecasting Pipeline for Segmented Time Series: Application to Cross-Trip State-of-Charge Prediction in Automated Electric Vehicles. IEEE Trans. Intell. Veh. 2024; early access. [Google Scholar] [CrossRef]
Greenacre, M.; Groenen, P.J.; Hastie, T.; d’Enza, A.I.; Markos, A.; Tuzhilina, E. Principal component analysis. Nat. Rev. Methods Primers 2022, 2, 100. [Google Scholar] [CrossRef]
Spanos, G.; Giannoutakis, K.M.; Votis, K.; Viaño, B.; Augusto-Gonzalez, J.; Aivatoglou, G.; Tzovaras, D. A lightweight cyber-security defense framework for smart homes. In Proceedings of the 2020 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), Novi Sad, Serbia, 24–26 August 2020; pp. 1–7. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: Berlin/Heidelberg, Germany, 2013; Volume 112. [Google Scholar]
Antoniadis, A.; Lambert-Lacroix, S.; Poggi, J.M. Random forests for global sensitivity analysis: A selective review. Reliab. Eng. Syst. Saf. 2021, 206, 107312. [Google Scholar] [CrossRef]
Aivatoglou, G.; Anastasiadis, M.; Spanos, G.; Voulgaridis, A.; Votis, K.; Tzovaras, D.; Angelis, L. A RAkEL-based methodology to estimate software vulnerability characteristics & score-an application to EU project ECHO. Multimed. Tools Appl. 2022, 81, 9459–9479. [Google Scholar]
Jolliffe, I.T. Principal Component Analysis for Special Types of Data; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Parmar, A.; Katariya, R.; Patel, V. A review on random forest: An ensemble classifier. In Proceedings of the International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI), Coimbatore, India, 7–8 August 2018; Springer: Berlin/Heidelberg, Germany, 2019; pp. 758–763. [Google Scholar]
Polymeni, S.; Pitsiavas, V.; Spanos, G.; Matthewson, Q.; Lalas, A.; Votis, K.; Tzovaras, D. Toward sustainable mobility: AI-enabled automated refueling for Fuel Cell Electric Vehicles. Energies 2024, 17, 4324. [Google Scholar] [CrossRef]
Wang, W.C.; Chau, K.W.; Xu, D.M.; Chen, X.Y. Improving forecasting accuracy of annual runoff time series using ARIMA based on EEMD decomposition. Water Resour. Manag. 2015, 29, 2655–2675. [Google Scholar] [CrossRef]
Mahjoub, S.; Chrifi-Alaoui, L.; Marhic, B.; Delahoche, L. Predicting energy consumption using LSTM, multi-layer GRU and drop-GRU neural networks. Sensors 2022, 22, 4062. [Google Scholar] [CrossRef]
Brownlee, J. Introduction to Time Series Forecasting with Python: How to Prepare Data and Develop Models to Predict the Future; Machine Learning Mastery: Dorado, CA, USA, 2017. [Google Scholar]
Xu, Y.; Goodacre, R. On splitting training and validation set: A comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. J. Anal. Test. 2018, 2, 249–262. [Google Scholar] [CrossRef]
Zhou, J.; Shi, J.; Li, G. Fine tuning support vector machines for short-term wind speed forecasting. Energy Convers. Manag. 2011, 52, 1990–1998. [Google Scholar] [CrossRef]
Suradhaniwar, S.; Kar, S.; Durbha, S.S.; Jagarlapudi, A. Time series forecasting of univariate agrometeorological data: A comparative performance evaluation via one-step and multi-step ahead forecasting strategies. Sensors 2021, 21, 2430. [Google Scholar] [CrossRef] [PubMed]
Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef]
Hyndman, R. Forecasting: Principles and Practice; OTexts: Melbourne, Australia, 2018. [Google Scholar]

Figure 1. PCRF methodology.

Figure 2. Tampere (first period) forecasting results.

Figure 3. Tampere (second period) forecasting results.

Figure 4. Frankfurt forecasting results.

Figure 5. Carinthia forecasting results.

Figure 6. Trikala forecasting results.

Table 1. Summary of the evaluation results for all pilot sites.

	MAE	MdAE	RMSE	NMAE	NMdAE	NRMSE
Tampere (1st period)	5.4	4.08	6.03	13.50%	10.20%	15.10%
Tampere (2nd period)	10.55	10.03	12.98	17.30%	16.40%	21.30%
Frankfurt	20	17.67	23.45	12.40%	11%	14.60%
Carinthia	8.62	6.42	10.97	7.80%	5.80%	9.90%
Trikala	26.89	27.95	29.77	26.40%	27.40%	29.20%

Table 2. Comparative results.

		Tampere (1st Period)	Tampere (2nd Period)	Frankfurt	Carinthia	Trikala
	NMAE	13.50%	17.30%	12.40%	7.80%	26.40%
PCRF	NMdAE	10.20%	16.40%	11.00%	5.80%	27.40%
	NRMSE	15.10%	21.30%	14.60%	9.90%	29.20%
	NMAE	17%	19.67%	12.09%	6.85%	21.76%
NAIVE	NMdAE	12.50%	19.67%	7.45%	8.11%	14.71%
	NRMSE	21.15%	22.82%	17.46%	7.67%	28.53%
	NMAE	14.53%	23.15%	13.74%	11.62%	19.27%
AVERAGE	NMdAE	11.55%	23.01%	15.12%	10.19%	20.70%
	NRMSE	17.32%	27.54%	15.92%	13.29%	23.34%
	NMAE	17.97%	19.87%	12.11%	6.85%	22.19%
DRIFT	NMdAE	13.66%	19.98%	7.31%	8.11%	14.98%
	NRMSE	22.52%	22.94%	17.53%	7.43%	28.86%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Spanos, G.; Lalas, A.; Votis, K.; Tzovaras, D. Principal Component Random Forest for Passenger Demand Forecasting in Cooperative, Connected, and Automated Mobility. Sustainability 2025, 17, 2632. https://doi.org/10.3390/su17062632

AMA Style

Spanos G, Lalas A, Votis K, Tzovaras D. Principal Component Random Forest for Passenger Demand Forecasting in Cooperative, Connected, and Automated Mobility. Sustainability. 2025; 17(6):2632. https://doi.org/10.3390/su17062632

Chicago/Turabian Style

Spanos, Georgios, Antonios Lalas, Konstantinos Votis, and Dimitrios Tzovaras. 2025. "Principal Component Random Forest for Passenger Demand Forecasting in Cooperative, Connected, and Automated Mobility" Sustainability 17, no. 6: 2632. https://doi.org/10.3390/su17062632

APA Style

Spanos, G., Lalas, A., Votis, K., & Tzovaras, D. (2025). Principal Component Random Forest for Passenger Demand Forecasting in Cooperative, Connected, and Automated Mobility. Sustainability, 17(6), 2632. https://doi.org/10.3390/su17062632

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Principal Component Random Forest for Passenger Demand Forecasting in Cooperative, Connected, and Automated Mobility

Abstract

1. Introduction

2. Related Work

2.1. Statistical-Based Methodologies

2.2. AI-Based Methodologies

2.3. Research Gap and Novelty of the Proposed Work

3. Methodology

3.1. Overview

3.2. Algorithmic Procedure

3.3. Dataset Collection and Splitting

3.4. Evaluation Metrics

4. Results

4.1. Tampere 1st Phase

4.2. Tampere 2nd Phase

4.3. Frankfurt

4.4. Carinthia

4.5. Trikala

4.6. Evaluation Results

4.7. Comparative Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI