The experiment we designed has three phases, namely, Data Collection, Data Analysis and Preparation, and Applying Machine Learning Models.
Figure 1 explains our entire experimentation process. The main reason behind designing the three phases of experimentation is to work with the real data that a rider receives and/or gives. Moreover, it is imperative towards our goal, which is to provide each specific rider with a generic platform that can relay useful information with as few input data as possible without depending on the model or manufacturer of the e-mobility.
An application is employed to gather raw data, which is subsequently stored in the cloud. This application may obtain information from a variety of sources, such as user interactions or through the use of APIs. Once the raw data are collected, they undergo analysis to identify patterns, followed by data preparation for the machine learning process, which involves cleaning and transforming the data. In order to evaluate the performance of various ML algorithms, 10-fold cross-validation is employed, utilizing methods such as K-nearest neighbors (K*), random forest (RF), support vector machine (SVM), and additive regression (AR). After applying these ML algorithms using 10-fold cross-validation, model evaluation is conducted using log performance metrics, including mean absolute ratio (MAR), root mean squared error (RMSE), and relative root squared error (RRSE). More details will be provided in the subsequent sections.
3.1. Data Collection
An application named “Ddanigo LLC. Navigation app” has been developed under the company Ddanigo LLC, 4701 Patrick Henry Dr #23, Santa Clara, CA 95054, USA [
26] for collecting data and integrating machine learning techniques in the background, inspired by [
8,
22]. The mobile application uses flutter to ensure cross-platform availability for both Android and iOS devices.
In addition to the application, we have also compiled a comprehensive dataset containing technical information for two-wheelers worldwide. Our database contains information for 732 two-wheelers, each of which has been assigned a unique ID (named “E_ID” in storage) for easy retrieval and analysis of technical information. This dataset encompasses a range of attributes, including the two-wheeler name, model, year, weight, voltage (V), battery capacity in watt-hours (Wh), and maximum mileage in kilometers (Km). By analyzing this diverse and extensive dataset, we aim to gain valuable insights into the performance and efficiency of various two-wheeler models.
This dataset was prepared to support the user registration process in the application. When a user registers in the application, they are required to provide details about their two-wheeler, which are then matched against the information in our database. To simplify this process for users, we only require basic information, such as the name, model, and manufacturing year, which can be easily found on the manufacturer’s website. By matching the user’s two-wheeler details with the technical information in our dataset, the application’s database creates another unique ID (named “UE_ID” in storage) that combines both the user and two-wheeler information. This ensures that users can easily access accurate technical specifications for their two-wheeler without needing to worry about the details themselves. However, if a user’s two-wheeler is not found in the dataset, they can still register their vehicle by providing basic information. The application will then generate a unique ID, even if technical specifications are not available.
Once the two-wheeler has been set up, the application can provide navigation services. While providing navigation services, the application can collect data in the background, which are then saved to cloud storage after each individual trip. The data we have collected during each trip can be categorized into four categories: two-wheeler data, environmental data, route data, and rider behavior data. The choice to use data for range prediction is well founded, as these data categories are directly related to the factors that can influence the range of a two-wheeler. As we have discussed in
Section 2, the authors [
8] explored the impact of similar factors such as two-wheeler, environmental, route, and bicyclist effort data, where bicyclist effort was measured in terms of power. On the other hand, the study [
22] also focused on comparable factors but they used different aspects of bicyclist performance, which is average speed. While measuring power accurately often requires additional sensors and hardware, the goal in this case is to provide a solution that does not rely on extra equipment. Instead, our focus was to utilize easily accessible information that can be collected through a smartphone. By concentrating on distance, time, and speed as rider behavior metrics, we can analyze and understand performance without the need for specialized sensors.
The experimentation was carried out only with an android device. The application requires internet connectivity and GPS to perform data collection. The routes were selected based on accessibility, ability to perform rides without interruption, distinguishable differences in elevation throughout the route, and good reception of GPS and internet connectivity. Furthermore, the same route was used to obtain data for different battery levels to ensure the proper distribution of data. In
Table 3, we have listed down the data that will be collected using the application from the experimental test rides, among them, which data are being collected every second of the ride and which of them are being input by the user. All the data are collected for every ride separately.
We collected data from approximately 100 trips conducted by a rider where the average distance and time for the trips is around 7.46 km and 24.67 min. This rider was male, around 28 years of age, and weighed around 68 kg. The data were gathered using the application, which was provided to the rider on a device for his use during various trips along the designated test routes. All the routes were covered during summer days.
The application only asks the rider to give input regarding battery levels at the start and end of each trip. The rest of the data are collected and calculated from different sensors and APIs from within the application in the background throughout the ride. We have used “Flutter Weather” plugin to collect weather type and temperature of rider’s current location when they navigate. The application was integrated with google maps API as a map server to collect route information. Furthermore, we have used direction API to obtain the list of routes. The GPS sensor along with the geocoding plugin from flutter gives us relevant information regarding location, distance, and speed every second. Subsequently, we have stored all these raw data in cloud storage for further analysis and calculation because of the restricted storage and processing capability of mobile devices [
27].
The collection of personal data, such as user locations, riding habits, or other identifiable information, raises privacy concerns. To address this issue, all information related to the users was securely stored in a central database. The collected data were strictly used for the intended research purposes, and no unauthorized usage or sharing of the data was allowed. Proper access controls were implemented to ensure that only authorized personnel had access to the data for research purposes.
3.2. Data Analysis and Preparation
We have sorted out the dataset from the stored raw data in this phase of the experiment. Although our stored data contained data collected in a trip for every second, we have converted the data such that one trip became one instance. Firstly, we converted the battery level data we collected from the rider. We observed that the range can differ based on the initial battery state. Moreover, it was observed that, while the initial battery level was 80–100%, the range provided by e-mobility is higher, while the last 20% provides less range than other states. Based on our observation, we have categorized the initial battery levels into five categories, named A, B, C, D, and E, ranging from 80 to 100%, 60 to 79%, 40 to 59%, 20 to 39%, and 0 to 19%, respectively. Subsequently, we converted the final battery level to total battery consumption by using the following formula:
Although different studies have shown the impact of environmental factors coming into play for range prediction and we are collecting the environmental data for each trip, we could not use the data in our study, as the environmental data were almost constant in every single case. Therefore, we have omitted the data from our final dataset, since the impact from them will be nonexistent. Subsequently, we have also omitted the location data, as they have no factor to play in the range prediction either.
In our next step, we determined the elevation gain for the entire trip. Since we have the elevation data for every second, we needed to ascertain the total for the entire trip. To perform the calculation, we used the following Formula (2):
where
n is the number of elevations in a trip.
Finally, we have converted the speed data into three fields, namely, total acceleration, total deceleration, and total stop time. In order to achieve this feat, we calculated acceleration and deceleration from the speed data of every second at first. Then, we counted the number of times a rider accelerated and decelerated in a single trip. Furthermore, the total stop time was calculated based on the number of times speed went to 0. The following formulas were used for such calculations:
Furthermore, the distance recorded was in meters, as it was collected every second from the previous position to the new position in order to clearly represent the actual travel distance rather than the distance provided by the APIs. The summation of said distance was then converted into kilometers to be included in the final dataset.
where
n is the number of instances in stored data.
As explained in
Section 3.1, each specific two-wheeler in our database is assigned a Unique ID (named “UE_ID” in storage) that is created by combining the specific user’s information with the technical information of the specific two-wheeler. Since the user will always be making trips with their own two-wheeler, the technical details associated with the UE_ID will remain constant. Rather than needing to include all of the technical information for the specific two-wheeler in each instance of data collected during trips, we can simply use the UE_ID to reference the technical information stored in our database. Therefore, it is not necessary to have a lot of technical information about the two-wheelers, such as total range, battery capacity, or motor type, as the UE_ID can be used to identify and track the vehicles accurately. Because of that, we are not including the technical details of each two-wheeler in the final version of the dataset.
The final processed dataset, after all the processing, omission, and conversion, has 7 attributes in total, where 6 of them will be used as input features for the machine learning models and distance will be the output. In
Table 4, these attributes and their descriptions have been listed down. The dataset contains a total of 100 instances, with each instance representing a different trip. Furthermore, we have checked for irregularities and missing values in the final dataset and have found none.
Figure 3a,b illustrates the scatter plot that displays the relationship between distance and the various factors of the processed dataset. Since the output variable is a continuous numerical value, regression model is a suitable choice for predicting the distancer rather than classification model. Therefore, the processed dataset was ready for the next phase.
3.3. ML Models
Figure 4 illustrates the step-by-step process of the machine learning algorithms and their evaluation used in the study. The flowchart provides a clear visual representation of how the parameters were processed to develop the machine learning model and how it was subsequently evaluated. The process starts with the raw data, which are collected from various sources. The collected data are then preprocessed to remove any noise or unwanted information. After preprocessing, the relevant features are extracted from the data and selected based on their importance. These features are then divided into two datasets, which are training set and validation set.
The training set consists of 60 data points and is used to train the machine learning model via 10-fold cross-validation, allowing it to learn patterns and relationships between the features and the target variable. In this technique, the original dataset is randomly divided into 10 equal-sized subsets, also known as “folds”. The model is then trained on 9 of the folds and tested on the remaining fold. This process is repeated 10 times, each time using a different fold for testing and the remaining folds for training. Furthermore, the training results were used to select the best model for the validation process. The validation set, on the other hand, comprises 40 data points and is used to assess the model’s performance. The best performing model is then evaluated using the validation dataset and the evaluation matrix.
In this study, we utilized six different machine learning algorithms to predict the range which is a continuous numerical value. We chose to use regression algorithms as they are commonly used to predict continuous values. In order to conduct this study, we have used java programing language with the help of the weka library. All the algorithms were coded in the same program and the evaluation was given as the output in a csv file. We have used three core algorithms and then used an additive model based on these three algorithms to make it a total of six. The algorithms we have used are KStar (K*), random forest (RF), support vector machine (SVM), and additive regression (AR). Furthermore, we have used the K*, RF, and SVM as the base of AR, thus rounding the total number of algorithms to six. We have used 10-fold cross-validation on these 6 algorithms and the results were stored in a csv file.
KStar (K*): As a classifier, KStar relies on the similarity between training and test samples to determine the label for a given occurrence. To differentiate itself from other instance-based learners, it uses a distance function based on entropy. To assign a label to an instance, instance-based learning systems consult a database of labeled examples. The essential assumption is that analogous cases are categorized in the same way. There is a need to settle on a common understanding of “similar instance” and “similar class.” The associated parts of an instance-based learner are the distance function, which determines the degree of similarity between two instances, and the classification function, which explains how the instance similarity generates a final classification for the new instance [
28].
Random forest (RF): To classify data, random forest uses a collection of decision trees that are generated at random. If there is use of bagging, random forests will introduce even more unpredictability. In random forests, each tree is built from a different bootstrap sample of the data, and the tree construction process is altered to improve accuracy. The optimum split across all variables is used to divide each node in a typical tree. Each node in a random forest selects a subset of predictors at random and uses the best of them to make predictions. This seemingly contradictory approach outperforms a wide variety of classifiers and support vector machines, including discriminant analysis and neural networks, and is highly resistant to overfitting [
29]. RF can be described as:
Support vector machine (SVM): SVM-based regression models are helpful for modeling complicated relations that cannot be well captured by lower-order polynomial equations. Because of its high power and great generalization capacity, SVM is widely used for tackling issues involving pattern recognition, classification, regression, and prediction. SVM can be described with the following equation [
30]:
where
w is the weight factor,
Φ represents the mapping function,
x is the input vector, and a is the bias.
Additive model (AM): One type of nonparametric regression technique is the additive model (AM). The AM constructs a subset of nonparametric regression models with the help of a one-dimensional smoother [
31]. Here is a simplified representation of the training procedure for additive models:
Here, Fm(x) is a model ensemble that improves upon the model by combining m weak learning models to find the true solution. A scaling factor, which can range from 0 to 1, is added to reduce the contribution of each iteration in an effort to prevent the model from overfitting.