1. Introduction
Transportation in urban areas is among the top challenges to improve people’s quality of life and to reduce pollution. Historically, private vehicles have been the preferred mode of transportation. Orthogonally, governments invest in public transportation systems to offer alternatives to reduce traffic and pollution. With the rise of the sharing economy, we are now witnessing a transition towards new forms of shared mobility, which have spurred the interest of both the research community and the private companies.
Car sharing is an evolution of the classic car rental model. Here, users can rent cars on demand for a short period, e.g., a 20-minute trip across town. In particular, Free-Floating Car Sharing (FFCS) services allow customers to rent and return the cars everywhere inside an operative area in a city. Customers book, unlock, and return the car by using an application on their smartphones. In the FFCS implementation, the provider bills the user only for the time spent driving, with simple minute-based fares which factors all costs. Car2Go (
https://www.car2go.com/) is one of the FFCS services that currently operates in several cities around the world. Some studies demonstrate that a massive adoption of car-sharing service can improve mobility as well as reduce costs and pollution [
1,
2,
3].
To properly design and manage a FFCS service, a provider needs to know the demand for cars over different periods of the day, and over the different areas of the city. The prediction of FFCS demand patterns is thus fundamental for an adequate provisioning of the service. Armed with good predictions, the provider can better plan long-term system management, e.g., whether to extend the operative area to those neighborhoods with expected customer growth. Similarly, it can implement short-term dynamic relocation policies to better meet the demand in the next hours [
4,
5,
6].
In this work we investigate the prediction of the usage dynamics of a real FFCS service. We aim at assessing how state-of-the-art machine-learning algorithms can help FFCS providers and policy makers in predicting the demand, both over time and across different spatial regions. More specifically, we leverage a dataset of real rides from cities where Car2Go is offering its FFCS service. We consider as a case study the city of Vancouver, Canada, the city with the highest demand for cars in our dataset. We rely on more than 1 million rentals covering 9 months in 2017 [
7]. We augment the dataset by exploiting a rich and heterogeneous open dataset, namely the 2016 Vancouver Municipality census (
https://opendata.vancouver.ca/pages/home/). This second dataset comprises more than 800 features, which detail very diverse information about shops in each neighborhood, weather conditions, residents, rate of emergency calls throughout the day, etc. Our goal is to first assess to which extent it is possible to predict the FFCS demand over time and space, and second, which of the features have a higher prediction power.
Our work focuses on two scenarios. In the first scenario, we investigate how to predict the demand for cars in the future considering past usage. This is fundamental for managing the FFCS fleet both in the short term (e.g., implementing relocation policies during service peak time), and in the long term (e.g., to properly match the fleet size to the future system growth). To this end, we analyse machine-learning algorithms that are considered state of the art, from simple Linear Regression and traditional Seasonal Auto Regressive Integrated Moving Average (SARIMA) models, to Random Forests Regression (RFR), Support Vector Regression (SVR) and latest approaches based on Long Short-Term Memory Neural Network (NN) [
8,
9]. With the increasing complexity of these models, we aim at assessing not only how they perform in our target prediction task, but also to which extent one would need to embrace a complex model (such as NNs are) or rather simpler and more informative models (like linear regression and RFR are).
In the second scenario, we correlate socio-demographic indicators with FFCS demand. We predict the demand of cars in a neighborhood without past data, using only socio-demographic data. This problem is often referred to as a green field or cold start approach. In this case, the operator is interested in knowing what the expected system usage in a new neighborhood is (or even a new city) based only on socio-demographic data. We map the FFCS demand to Vancouver neighborhoods, and associate them to the socio-demographic data coming from the official Vancouver census. We then use machine-learning techniques to highlight the relationship between demographics and customers’ mobility. We aim at answering the following research questions: (i) Using modern machine-learning methodologies, and armed with a rich socio-demographic data, would one be able to predict the temporal mobility patterns in a city? And (ii) which would be the most important socio-demographic data to use for this task?
Through a series of experiments, we show that the temporal prediction of rentals can be solved with errors as low as 10%. Interestingly, Random Forest Regression turns out to perform stably better than the other models, including Neural Networks, for this task. When considering the mobility prediction using only socio-demographic data, we obtain errors in the 40–50% range. While this performance may not be accurate enough for a precise planning, this prediction still would be useful for operators willing to decide, e.g., to which new areas of the city to extend their service. Interestingly, our models allow us also to observe what features are the most useful for the prediction problem, precious information for providers and regulators that wish understand FFCS systems—to decide, for instance, in which new cities to start a service (green field problem). Our work suggests, for example, that the density of people commuting by walk and the number of emergency calls in a neighborhood are important factors for predicting the number of rentals that will start there. We note that emergency calls are used as a proxy for human activity, i.e., the more human activity the larger the number of emergency calls. Given this assumption, we can leverage the information about the volume of emergency calls to improve prediction at different periods of the day. As for the temporal prediction, knowing the weather conditions in the near future would improve prediction too.
After overviewing the related work in
Section 2, we describe the data collection methodology we adopt in
Section 3.
Section 4 provides a characterization of the datasets, while
Section 5 and
Section 6 provide details about the methodologies and results for the temporal and spatial prediction, respectively. Finally,
Section 7 summarizes our findings.
2. Related Work
With the ease of collecting data and the ability to build and train off-the-shelf machine-learning solutions, researchers have started applying data driven approaches in the context of transportation. Previous work [
10] addressed traffic modeling and prediction with real traffic data, and proposes strategies to improve congestion prediction using Kalman filters, showing how traffic is stationary in time. Other studies [
11] proposed new approaches based on a multivariate extension of non-parametric regression to predict traffic patterns, with the goal of counteracting traffic congestion. While similar in spirit, our work focuses on FFCS services explicitly, and uses a much richer dataset as well as more advanced machine-learning algorithms.
Focusing on car sharing, early work focused on estimating demand using activity-based micro-simulation to model how agents move around in a city [
12]. Later, as data from operative car-sharing platforms became available, researchers started using real data to analyze mobility demand. For instance, previous work [
2,
13] proposed a demand model to forecast the modal split of the urban transport demand. Similarly, other studies [
3] investigated the Mobility-as-a-Service market, where FFCS is one of the implementations, and pointed out how FFCS supply can push the users to avoid purchasing a new car, which would lead to a reduction of
emission. Yet, none of these prior studies focused on car-sharing demand prediction.
Along the same lines, other studies [
14] made a large survey covering a Swiss station-based car-sharing service. The results confirmed that FFCS is preferred as a fast alternative to public transportation and the subscription depends on the car-sharing implementation (business model). Previous work [
4] also proposed a simple binary logistic model for predicting car-sharing subscribers in Switzerland, considering the relationship between potential membership and service availability. This relationship was then used to identify areas with unmet demand, i.e., areas where new car-sharing stations could be placed.
Other studies [
15,
16] conducted a detailed characterization of a car-sharing system in Munich and Berlin. As with our work, they identified features correlated with the demand for shared cars in the target cities. However, our work differs from their in the sense that we here analyze a much larger set of features, including demographics and economic data, and consider multiple prediction models. We focus on demand prediction, facing both time and space dimensions, and provide a thorough comparison and guidelines for future directions.
In our previous work [
17], we analyzed in depth the usage of different car-sharing systems in Vancouver. Based on this data, we developed a model of FFCS usage and built a simulator to design new systems based on electric vehicles [
5]. In particular, we tackled the charging station placement problem, showing that the optimal placement requires few stations to satisfy charging requests in different cities [
6].
To the best of our knowledge, we are the first to face the demand prediction problem in Free-Floating Car-Sharing Systems tackling both the temporal and spatial prediction with a real-world heterogeneous dataset. The demand prediction problem (or its variations) has been tackled in other domains [
18,
19], but we here focus on multiple prediction tasks (long-term, short-term) across different aspects (temporal and spatial) on the car-sharing domain.
Furthermore, while previous work [
20] focused on the temporal prediction of car-sharing demand in a very short-term basis (demand prediction in the next few minutes), in this work we focus on the problem at different time scales. We also compare several prediction strategies and analyze how the temporal prediction problem relates to the spatial prediction one. Moreover, we are the first to use a very heterogeneous dataset including dozens of features to tackle the prediction problems. This allows us to provide insights on which of those features are the most important ones to solve our prediction problems as well as to have a broader perspective on the challenges involved in car-sharing prediction.
5. Temporal Predictions of Rentals
In this section, we describe our task of predicting the number of rentals in the whole city at a given time in the future. Eventually, the same methodology could be applied for each neighborhood. This prediction can exploit historical data, i.e., given the time series of rentals in the past, predict the number of rentals in the future. If only the past time series are used, the problem falls in the univariate regression class, i.e., the prediction is based only on past data of the same target variable. Let
be our target variable, i.e., the number of rentals at time
t. In the case of prediction with historical data, we predict
as a function
of the past
data points of
x itself.
j is the horizon of the prediction.
If we also have other information, we can build a more generic model to consider the dependence to other variables. We want to predict
where
are different variables—possibly other time series themselves (including
x)—and
g is the model that allows us to predict
x at time
. This problem is a multivariate regression problem, where multiple features are used to predict the target variable
x.
Considering the time horizon of the prediction, we can formulate two versions of the problem: predict the long-term or short-term usage. In the first case, we build and train a single model using all data at our disposal to predict the system usage in the next months. In the short-term version, we target the prediction of the next time bin only, i.e., . In this second case, we build and update a new model at each time bin by adding the latest recorded number of rentals to the training set as soon as it becomes available.
Both predictions are important for the car-sharing provider. For instance, the long-term predictions are important to know if their fleet size is enough to keep up with the expected demand. The short-term is important to know when to take a car down for maintenance, or when and where cars should be eventually relocated to those neighborhoods where the demand is expected to increase shortly. While for long-term prediction we use the time series of the rentals and information about day of the week and hour of the day, for short prediction we can also use the near future weather condition information.
In this work, we consider discrete time, i.e., we split time into fixed size time intervals as defined in the aggregation step—see
Section 4 for more details. We then build and train several machine-learning models to tackle each aforementioned problem. Our goal is to compare algorithms in terms of accuracy of the prediction and complexity of the model. At last, we are also interested in considering models that are interpretable, i.e., that allow us to understand which are the most important features that affect car-sharing usage in large cities. We evaluate all models considering three metrics: APE (absolute percentage error), MAPE (mean absolute percentage error), and RMSE (root mean square error) over the validation set. The APE is defined as
where
V is the validation set,
is the actual value of the data at moment
and
is the predicted value. The MAPE is then given by
and the RMSE is defined as
5.1. Prediction Models
We use off-the-shelf machine-learning models both for the long-term and short-term scenarios. We consider the following univariate models: a simple baseline (BL) approach, the auto-regressive moving average (ARIMA) and the seasonal auto-regressive moving average (SARIMA) algorithms. Univariate models do not account for the influence of other time-variant factors such as weather conditions, time of day, number of emergency calls, etc. To account for that, we also investigate the performance of linear regression, Random Forests Regression (RFR), Support Vector Regression (SVR), and long-term short-term memory neural networks (NN).
We add categorical features (the day of the week and weather, for instance) to these algorithms to improve on the univariate models. Following correct practices [
21], we represent each categorical feature as many binary variables, one for each category. For example, when representing a given weather type, the corresponding binary variable will be set to
while all the other weather-related variables to
. We used the algorithms implementation in Python libraries
scikit-learn (
https://scikit-learn.org/) [
22] and Keras (
https://keras.io/). Our code for the analysis is publicly available at
https://github.com/dougct/carsharing-prediction. For details about each model, we refer the reader to [
9]. In our implementations, we start with the library’s default hyper-parameters and conduct a grid search to find a set of such parameters that worked well with our models. We report the range of the grid search along with the description of the models below.
Baseline. A simple approach to determine in a time bin is to take the average number of rentals in the same time bins in the available past days. We compare all our prediction models to this baseline.
ARIMA. ARIMA (auto-regressive integrated moving average) is widely used to predict time series data. ARIMA models are a combination of auto-regressive models with moving average models. The creation of an ARIMA model involves specifying three parameters . The d parameter measures how many times we must differentiate the data to obtain stationary data. After determining d, we use sample partial auto correlation function to get the value p. Finally, we determine the order q by looking at the sample auto correlation function of the differentiated data. For simplicity, we restrict our grid search to find the best parameters values to the range . The combination that gave us the best results is .
SARIMA. A SARIMA model incorporates the seasonality (periodicity) of the data into an ARIMA model, enhancing its predictive power. For instance, when modeling a time series, it is often the case that the data has a daily, weekly, or monthly periodicity. We used our previous ARIMA model with an additional explicit daily seasonal component ( as the number of time bins in a day in our case).
Linear Regression. We fit a linear model by finding the coefficients that multiply each feature.
SVR. In our experiments, we use a Support Vector Regression (SVR) model with the following combination of parameters, which produce the best results among the values we tested: , , and , with the RBF kernel. The values for the parameters , and were evaluated in the range , and for the C parameter we considered the range , using exponential steps. The value 1000 was chosen once it provided a reasonable balance between model performance and generality.
RFR. Random Forest Regression is an ensemble learning method that can be used for regression. The decision is based on the outcome of many decision trees, each of which is built with a random subset of the features. One advantage of random forests over linear regression is that the forest model can capture the non-linearity. Another advantage of RFR is that they are interpretable models, i.e., they offer a ranking of the most important features for the prediction problem. Here, we use 50 decision trees (Throughout the manuscript, interpretable refers to the fact that it is possible to understand the decision taken by the classification model. However, interpretability has not to be confused with explainability, which refers to the motivations of the decision. The latter is only possibly via domain knowledge). In this model, we use the default library parameters, but we evaluate the impact different numbers of trees, for which the results are shown in the next sections.
Neural Networks. We also consider a Long Short-Term Memory (LSTM) Neural Network model. LSTMs have a memory that helps capturing past trends in the data, which may favor our prediction task. We experiment with several different architectures. In particular, we test different configurations for the architecture: the number of neurons varies in the range
for the first layer, and in the range
for the second layer. Because of the nature of the task (regression and not classification), the number of neurons in the third layer is set to one. The best results are obtained with a three-layer architecture where the input layer has 64 neurons (one for each feature), the dense layer has 4 neurons, and the output layer has one neuron. In our experiments, to balance prediction accuracy and training time, the model is trained for 50 epochs. As we will see in
Section 5.3, increasing the number of epochs to more than 50 has no significant effect (less than 1% reduction in the MAPE, on average) on performance.
5.2. Long-Term Predictions—Results
Here we predict the FFCS demand for cars in the future months given a model built on the previous months. We use in our experiments car-sharing usage data for the first nine months of 2017 in the city of Vancouver. Given the volume of rentals in the training period, we try to predict the number of rentals in the validation period. For that, we use a model that is trained once and then used to perform all the predictions in the validation period. Our training set consists of the volume of rentals for the first six months, and the validation data consists of volume of rentals for the next three months.
Table 1 shows the average mean absolute percentage error (MAPE), the standard deviation of the APE, and the RMSE for each of the prediction models. The models that rely only on the time series (ARIMA and SARIMA) can capture some patterns in the data, as their performance is considerably better than the baseline. However, the multivariate models perform better, with Random Forest Regression reaching the best performance. In
Figure 7 we show the comparison between the actual values and the prediction in one month of the validation set using the Random Forest Regression model (orange dashed line). Overall, the model can predict quite well the daily and weekly periodicity of rentals, but in general it slightly underestimates the actual number of rentals. This could be due to the fact the training period refers to the first six months of the year, during which the average number of rentals is lower than during the validation period (Fall season).
5.3. Short-Term Predictions—Results
We now tackle the problem of predicting the demand of cars in a city in the next time bin. Differently from the long-term predictions we use adaptive models, i.e., the model is re-trained every time new data is made available, then we can add it to the training set. We here focus on the following prediction task: given the volume of rentals per time bin period for a specific number of past days and the weather conditions in those days, predict the number of rentals in the next time bin period.
We study this prediction task using two approaches: expanding window and sliding window. In the expanding window approach, after making the first prediction, we add the actual value to the training set, therefore increasing the amount of data available for training in the next step. To train our models, we first set aside 24 days of data for validation, and start with 28 days of training data. In the sliding window approach, after making the prediction we remove the oldest training data and add the actual value to the training set. Therefore, the training set size is always the same during the evaluation of the models. To train our models, we consider different sliding windows sizes (from 7 to 28 days), and validate on the same validation set of 24 days as with the expanding window.
In
Table 2, we compare the performance of all models using the two approaches. The best results for the sliding window approach are obtained with the largest possible window (28 days). The expanding window approach offers slightly better results, which can be attributed to the fact that the model can exploit more data, while patterns are not changing rapidly in time. Again, the multivariate models, and in particular the Random Forest Regression model, reach the best performance. Interestingly, the Neural Network model performs similarly to other models, suggesting that for this specific use case, a simple and more interpretable model such as an RFR is enough. Furthermore, as shown in
Figure 8, increasing the number of epochs does not have a significant effect on the performance of the Neural Networks model.
We show in
Figure 9 the performance of the best model, i.e., RFR with expanding window. In this short-term formulation of the problem, the prediction naturally adapts to changes over time, resulting in better predictions to the long-term prediction scenario. Moreover, the weather data also provides useful information.
We now explore the importance of each feature for the model by analyzing the RFR feature ranking. When training a decision tree, it is possible to compute how much each feature decreases the tree’s weighted impurity. For a forest, the reduction in impurity from each feature can be averaged and the features can be ranked according to this measure. This gives a simple and interpretable feedback on which features are most useful for the prediction. We find that the most important features for the model are: (i) if we are in the daily peaks from 3 pm to 9 pm, (ii) during the night (0–6 am) or (iii) if we are on a Friday and Saturday. Interestingly, the most important weather condition for the regressors is the presence of clouds, while the second one is a (rare) condition of presence of fog, mist and rain in the considered time bin.
5.4. The Effect of Weather Information
At this point, it is relevant to discuss the importance of weather forecast for the predictions. First, for the long-term predictions, we did not use any weather information, as that would require perfect weather forecast in a period far in the future (in our case, three months). In order to validate the effect of weather in this idealized situation, we assumed such perfect forecast and evaluated our models using weather information as a feature. By assuming perfect forecast, we can set an upper bound on the effect of weather information on the models. Our results show that on average, weather information improves the models by about 3% on the MAPE.
Second, for the short-term predictions, we do use weather information. Again, we assume perfect weather forecast in the short-term (next three hours). This assumption is reasonable because weather forecast for such short periods should be quite close to perfect. By doing so, we filter out any dependence on the particular weather forecast technique used (which could vary across different cities/countries and is therefore out of the scope of our work).
According to the feature importance, among the features used for the short-term predictions (day of the week, hour of the day, and weather type), the weather is the least important feature. As such, we do not expect a great impact of weather mispredictions on our results. Indeed, our results with the random forests model (the one with the best performance among the models we evaluated) show that by removing weather information from the features the prediction accuracy decreases by less than 2% on the MAPE.
6. Spatial Prediction of Rentals with Socio-Demographic Data
We now shift our attention to predict the demand of cars in a neighborhood without using past data as features. In other words, given only socio-demographic data in the neighborhoods, we try to predict the average number of expected rentals at each time bin, and at each neighborhood. This problem is often referred to as a green field or cold start approach. In this case, the operator is interested in knowing what the system usage in a new neighborhood could be (or even a new city) based only on socio-demographic data. Historical data are available from other neighborhoods (or cities), and are used only for training.
Since we have 22 neighborhoods which constitute our dataset for the training step, we could suffer from an overfitting problem. To minimize this potential effect, we follow a state-of-the-art approach, namely leave-one-out testing: given a target neighborhood, we consider information from all other neighborhoods for training the learning model, and consider the neighborhood that we left out for validation.
We manually select 83 socio-demographic features that we think might be related to human mobility. Here, we only apply the Support Vector Regression and Random Forest Regression models, given that they were the best performing models (aside from neural networks) in the temporal prediction. We do not consider neural networks since these are known to not work well with a very small training set as in this case. Additionally, being the RFR an ensemble method, it is known to be resilient to overfitting [
9].
Considering hyper-parameter tuning, for SVR, we try three different kernels (linear, polynomial and RBF), with different combinations of parameters. The best performances are obtained for , ( for RBF), and ( for RBF). For RFR, we try number of trees ranging from 10 to 100. We show the impact of hyper-parameter tuning in the following.
Figure 10a and
Figure 11b show the SVR prediction accuracy for the task of predicting the number of starting and ending rentals, respectively. For each kernel type and for each time bin, we report the average MAPE over the 22 experiments (one for each neighborhood that is left out during training). The SVR model performs rather poorly regardless of the parameter setting. Considering the targeted time bin, errors are higher for the morning slots, independently of the kernel, while the time bin from 0 am to 6 am is the one for which the model achieves the best performance. The polynomial kernel performs the best: yet the average (over all time bins) MAPE is 70% for the prediction of starting rentals, and 64% for the prediction of ending rentals. For the sake of completeness, best RMSE for starting and ending rentals predictions are 499.776 and 427.675, respectively, both for time bin from 0 am to 6 am.
The results for the Random Forest Regression model are shown in
Figure 11a,b, for different number of trees. For a given time bin, we observe limited variation in the MAPE for increasing number of trees, which suggests that a small number of trees (30 or 40 trees, for instance) could be enough. This is expected given again the limited number of samples for the training. In this case, the overall MAPE is 59%.
Moving to the predictions for ending rentals in
Figure 11b, we observe smaller errors, with the best case with 20 or 40 trees, with the overall MAPE being 56%. Again, in the time bin from 0 am to 6 am we obtain the best predictions while the worst are obtained from 6 am to 9 am (for starting rentals prediction). Regarding RMSE measure, the best value for starting rentals is 427.260 for time from 0 am to 6 am and 50 trees, while for final rentals the best RMSE is 732.825 for time bin 6 am to 9 am.
Overall, the usage of only socio-demographic data as features offers from quite large prediction error. In the following, we investigate which features are the most important so to also perform feature selection and possibly improve the model.
Feature Ranking and Selection
As in the previous section, we here analyze the feature ranking for the RFR model.
Table 3 reports the top-15 most relevant features along with their relevance. This feature ranking procedure allows us on the one hand to identify what information the FFCS operator should focus on when considering new neighborhoods of the city in which to implement its service. On the other hand, it allows us to reduce the number of features to use in the model: we can focus only on the most important ones.
To evaluate the impact of the features on the performance of the model, we train once again the RFR with an increasing number of features, chosen according to the given rank. We fix the number of trees according to the best average MAPE obtained in
Figure 11a,b: 40 trees for the starting and 20 for the ending rentals prediction.
Figure 12 shows the results. It reports the MAPE versus the number of features in the model. Notice the U-shaped curve of the average MAPE (dashed black line). Intuitively, too few features worsen the regression performance due to lack of information, but too many features reduce the performance since the training is more complicated and the model gets confused.
We further evaluate the RFR model by selecting the best number of features (the one that minimizes the average MAPE), which results to selecting the top 7 features in
Table 1. With this subset, the average MAPE is 41% and RMSE equals to 1104.501 for starting rentals, while for arrivals MAPE is 39% and RSME is equal to 1010.453. As expected, using only the most important features significantly improves the performance.
Finally, we explore the spatial prediction error, i.e., we look if there are neighborhoods that present significantly higher errors than others.
Figure 13 depicts the heatmap of the MAPE per neighborhood, averaged over all time bins. The more the area is red the higher the average MAPE is. Each green dot represents actual positions of starting or arrival rentals as recorded in the original trace. The neighborhoods with the highest error are the ones labeled 15, 18, 11, and 0. We can see that neighborhoods 15, 18, and 0 are in the periphery and intersect only partially with the rental area of the FFCS operator. This mismatch confuses the prediction since our model assumes the operative area coincides with the total area of each neighborhood. Thus, our model predicts much higher numbers of rentals (reflecting the whole neighborhood area) than the ones that are actually done (reflecting the restricted operational area). Neighborhood 0 has instead a large presence of parks where clearly the car cannot operate. As such, the features of this area are also not reflecting the entire area, fooling the classifier.
In general, the performance of the spatial predictions is lower when compared to the temporal predictions. This is expected given the nature of the problem, the limited amount of available data, and because the number of rentals varies widely within each neighborhood. However, we would like to emphasize that the results of the spatial prediction could be quite useful: the ranking of the regions in terms of service demand is indeed preserved in the predictions. In other words, the neighborhood with the largest demands, which could be the preferred locations to extend the service, would still be predicted correctly.
7. Conclusions
In this paper, we studied the problem of predicting FFCS demand patterns in time and space, a relevant problem to an adequate provisioning of the service and maintenance of the fleet. Relying on data from real FFCS rides in Vancouver as well as the municipality socio-demographic information, we investigated to which extent modern machine-learning-based solutions allow us to predict the transportation demand.
Our results show that the temporal prediction of rentals can be performed with relative errors down to 10%. In this scenario, a Random Forests Regression performs consistently among the best models, and allowing us to also discover which features are more useful for prediction. When considering the spatial prediction using socio-demographic data, we obtain relative errors around 40%, after feature selection. This is expected due to the scarcity of data, but the prediction results are still useful. Indeed, since the number of rentals varies widely within each neighborhood, the relative ranking is preserved. This is valuable for, e.g., looking for the area where to first extend the service. Again, using a Random Forest Regression model, we can observe which features are the most useful for the prediction, a precious information for providers and regulators that wish to understand FFCS systems and to provide a high-quality service that benefits both providers and its costumers.
As future work, we would like to investigate whether this same strategy generalizes to different cities. Answering this question is challenging due to the heterogeneity and diversity of open data in different cities, and of usage patterns of car sharing around the world. We conjecture that given similar data the methodology could be applied to other cities, as there is nothing specific to the analyzed city in it. However, the effectiveness of the models may change depending on peculiarities of each city. Still, it is an open problem towards which we have provided an important first step.