Forecasting Hydrogen Vehicle Refuelling for Sustainable Transportation: A Light Gradient-Boosting Machine Model

Isaac, Nithin; Saha, Akshay K.

doi:10.3390/su16104055

Open AccessArticle

Forecasting Hydrogen Vehicle Refuelling for Sustainable Transportation: A Light Gradient-Boosting Machine Model

by

Nithin Isaac

and

Akshay K. Saha

^*

Howard College Campus, University of KwaZulu-Natal, Durban 4041, South Africa

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(10), 4055; https://doi.org/10.3390/su16104055

Submission received: 19 March 2024 / Revised: 7 May 2024 / Accepted: 9 May 2024 / Published: 13 May 2024

(This article belongs to the Section Sustainable Transportation)

Download

Browse Figures

Versions Notes

Abstract

:

Efficiently predicting and understanding refuelling patterns in the context of HFVs is paramount for optimising fuelling processes, infrastructure planning, and facilitating vehicle operation. This study evaluates several supervised machine learning methodologies for predicting the refuelling behaviour of HFVs. The LightGBM model emerged as the most effective predictive model due to its ability to handle time series and seasonal data. The selected model integrates various input variables, encompassing refuelling metrics, day of the week, and weather conditions (e.g., temperature, precipitation), to capture intricate patterns and relationships within the data set. Empirical testing and validation against real-world refuelling data underscore the efficacy of the LightGBM model, demonstrating a minimal deviation from actual data given limited data and thereby showcasing its potential to offer valuable insights to fuelling station operators, vehicle manufacturers, and policymakers. Overall, this study highlights the potential of sustainable predictive modelling for optimising fuelling processes, infrastructure planning, and facilitating vehicle operation in the context of HFVs.

Keywords:

predictive; supervised; machine learning; refuelling behaviour

1. Introduction

There is a move towards cleaner forms of fuel. The transport sector has been found to contribute to about 22.9% of emissions worldwide [1]. Hydrogen fuel vehicles (HFVs) are noted to be the future of this transition as they produce zero emissions once on the road [2]. Hydrogen fuel as an alternative to traditional gasoline and diesel has gained momentum recently, and HFVs have become increasingly available to consumers [3,4]. However, as with any new technology, many unknowns and potential challenges must be addressed. One such challenge is understanding the refuelling behaviour of drivers and its impact on refuelling infrastructure [4]. Adequate infrastructure must be in place to support higher adoption rates of these vehicles [4], but establishing a hydrogen refuelling infrastructure is expensive because of insufficient vehicle demand [2,4,5,6].

Developing countries like South Africa are vested in integrating renewable energy sources. One of the primary goals in its Hydrogen Society Roadmap is decarbonising transportation by the year 2050 [7]. However, the availability of refuelling stations impacts “driver concern” and, consequently, the decision to adopt HFVs [8]. Additionally, it has been noted that a well-established refuelling network is required for HFVs, or any other vehicles, to operate commercially [9].

In addressing challenges associated with HFVs, predicting trip-related variables, such as arrival time, destination, and refuelling behaviour, plays a crucial role [9,10]. Limited studies focus on HFV-related refuelling behaviour [2,11,12,13]. Most studies consider factors such as energy costs, consumer preferences, price sensitivity to fuel, fuel efficiency, and socio-economic and technological factors [14,15]. However, very few factors, such as time of year (day of the week, month, etc.) and weather conditions, are considered, and these are also seen to have a plausible impact on HFV behaviour [16,17]. As noted in previous work [18], time of year and weather conditions can affect the efficiency and availability of refuelling stations and the range and performance of the vehicles themselves. Previous work [19] provided further insight into the impacts of driving context on driving behaviour and, consequently, on fuel consumption. Both infrastructure characteristics and weather conditions were found to cause speed reductions and fuel consumption increases. The study further noted that factors like temperature extremes, adverse weather conditions, and limited refuelling options during storms can lead to altered driving behaviours, increased energy usage, and concerns about range, influencing how often drivers refuel and plan their trips in HFVs. Another previous work [20] noted that, in colder temperatures, some HFVs showed increased fuel consumption of approximately 0.79 to 1.26 L per 100 km for every 1 °C drop in ambient temperature below 18.3 °C. Conversely, in warmer temperatures, these HFVs experienced an increase in fuel consumption of about 0.23 to 0.49 L per 100 km for every 1 °C increase in ambient temperature above 18.3 °C [20]. This suggests that HFVs struggle at lower temperatures. Hence, understanding how drivers respond to these weather conditions and adjust their refuelling behaviour is critical for successfully deploying and adopting HFVs. The primary outcome of this study is the selection and design of a suitable machine learning model that can most accurately forecast refuelling trips made by HFVs.

Various statistical and machine learning methodologies have been advanced to predict refuelling behaviour inherent in HFV trips. These encompass methodologies such as linear regression, autoregressive integrated moving average (ARIMA), support vector regression (SVR), and vector autoregressive moving average (VARMA) [21,22]. ARIMA models are explicitly designed for forecasting time series, providing straightforward interpretation due to well-established mathematical foundations. They offer insights into underlying time series components, such as trend and seasonality, and work well for stationary and well-behaved time series. ARIMA and VARMA models also handle missing data and outliers more efficiently than machine learning models. However, these traditional models require more domain knowledge and expertise in time series modelling than machine learning models, making them potentially more challenging to implement and tune [23]. They may struggle to capture complex nonlinear relationships and lack flexibility in handling different types of data, such as non-stationary or seasonal data, limiting their effectiveness in incorporating external variables [23,24]. On the other hand, machine learning models, specifically supervised machine learning (SML), offer distinct advantages for time series forecasting as they provide flexibility in capturing complex and nonlinear relationships within the time series data. Equipped with kernel functions, SMLs can adapt to various data patterns, making them suitable for tasks where relationships are not strictly linear and where seasonality and trends are intricate. SMLs can efficiently handle multiple features, allowing the incorporation of additional information like weather data or external variables into the forecasting process [25,26,27]. While ARIMA models excel with more straightforward stationary time series data, SMLs perform well with more dynamic and challenging data sets, providing enhanced forecasting accuracy and adaptability in real-world scenarios. It is crucial to consider the computational requirements and potential complexity of SMLs compared with ARIMA, along with the need for careful feature engineering and parameter tuning for optimal results. In terms of SML models, options to consider include [23,25,26,27,28,29]:

Linear regression: simple and interpretable, suitable for cases with primarily linear relationships.
Decision trees and random forest: excel in capturing nonlinear relationships and feature importance.
Gradient boosting (e.g., XGBoost or LightGBM): provides high accuracy, handles nonlinearity, and adapts to intricate patterns in time series data.
Recurrent neural networks (RNNs): These specialise in capturing temporal dependencies in sequential data and are valuable for intricate time-dependent patterns. However, they require substantial historical time series data and can be computationally expensive.

Hence, this paper considers several SML techniques, including linear regression, random forest, decision trees, gradient boosting, and LightGBM, to select the best one for developing a predictive model. The model aims to forecast refuelling trips based on weather and time of year [27,28,30]. This study contributes to the existing literature by selecting and designing an effective SML predictive model that can predict HFV refuelling behaviour in relation to weather and time of year (day of week, month) [31]. The development of models for HFVs is essential because of the scarcity of data specific to these vehicles [3,13,18,32]. Unlike conventional vehicles, which have extensive historical data available, HFVs are relatively new and data on their refuelling behaviours are limited. This study aims to fill this gap by developing a model based on available data and extracting and transforming the data to match HFV profiles by making assumptions on refuelling patterns based on current driving trends and other impacting factors such as weather. Moreover, while past research has focused on energy costs, consumer choices, and fuel efficiency, this study adds refuelling data, weather conditions, and the time of year. This paper considers traditional variables and environmental influences in the context of the time of year with the aim of providing practical insights for fuelling station operators, vehicle manufacturers, and policymakers. These predictions can provide valuable insights into the refuelling behaviours of HFVs under different environmental conditions. They can inform infrastructure planning, fuelling processes, and operational efficiency for HFVs, contributing significantly to cleaner transportation solutions.

2. Materials and Methods

2.1. Data and Analysis

Before constructing any model, it is essential to ensure the relevance of the data to the study. Acquiring the required data posed challenges as they were not readily available, necessitating extraction from various sources and subsequent transformation to align with this study’s needs. The primary data source was the United States Department of Transportation (Bureau of Transportation Statistics) [33], encompassing information on trips taken in New York City by distance in 2019 (pre-COVID-19) and corresponding weather conditions over the same period (data obtained from the study [34]). Conventional vehicle refuelling data were employed, building upon findings in various research works, including [1,2], which demonstrated that HFVs exhibit similar behaviour to traditional vehicles regarding driving range and refuelling capacity. Studying general refuelling patterns could provide insight into HFV behaviour, especially without extensive HFV-specific data. Ultimately, developing a model that can predict the number of refuelling trips a vehicle will make (i.e., trip count) based on any day in the year, the temperature, and the precipitation on that day can prove helpful to countries looking to adopt the technology. Although this model used general vehicle data from New York County, it still allows analogies between cities of the same size to be drawn to help predict future trip counts and trends expected for HFVs once adoption escalates [14,35,36,37,38].

Throughout this paper, the underlying assumption is that HFV drivers exhibit a refuelling pattern like that of conventional vehicle drivers. However, the data will be extracted and transformed to match HFV profiles to distinguish their refuelling behaviours from those of traditional vehicles. The assumptions that hold true in this paper to ensure the uniqueness of this study are presented in Section 2.5.1. While acknowledging this assumption, this study emphasises the potential for future model reconsideration with more reliable and denser HFV data [39]. The chosen data set was strategically selected for its transposability to cities of similar sizes within a country of choice, accommodating a maximum of 23,921 refuelling trip counts/observations per day.

Also, temporal relevance and alignment with research objectives guided the selection of pre-COVID-19 data. This deliberate choice aimed to establish a baseline, ensure the examination of stable patterns unaffected by anomalies, and provide valuable insights into underlying trends and behaviours forming the basis of the analysis. Moreover, the paper assumes that yearly trends remain consistent, thus utilising the chosen data set to predict patterns for the year.

Examining time series data involves employing exploratory data analysis (EDA) to unveil patterns, trends, and other inherent characteristics. Since time series data typically comprise a sequence of observations collected over time, EDA becomes instrumental in identifying patterns, trends, or irregularities within the data set. An integral aspect of EDA involves utilising statistical modelling techniques to delve deeper into the data and extract crucial insights. A commonly employed method for this purpose entails breaking down time series into their components, such as trend and seasonal elements [23].

In this study, the analysis employed the stats model library within the Python programming software (version 3.11) to decompose the time series data under consideration. Several outcomes were derived using this library to enhance the understanding of the data before the model development phase, with a specific emphasis on the seasonal data to establish the relationship between weather conditions (temperature and precipitation) and the trip counts (i.e., the number of refuelling trips a vehicle will make). This is shown in Table 1.

The observations obtained from the EDA are important because they reveal trends and patterns that can be deduced from the data, making it possible to investigate ways to model these trends and forecast or predict future data accurately [26,27,40].

2.2. Model Design

The aim is to develop a model that can predict the number of refuelling trips a vehicle will make based on any day in a year and on temperature or precipitation levels. The model design methodology is presented below:

Data preparation (previous section):
- Collect data on vehicle counts, including date/time and relevant weather features (e.g., temperature, precipitation).
- Evaluate data by performing an EDA to understand the general trends.
- Pre-process the data by handling missing values and encoding categorical variables for time series modelling.
- Extract and transform the data to match HFV profiles.
Feature engineering: extract and create relevant features from the date/time, such as day, day of the week, and month.
Split data: split the data set into training and testing sets to accurately evaluate the model’s performance using techniques such as time-based splitting.
Model selection options [24,25,26]:
- Linear regression: a simple baseline model.
- Decision trees and random forests: can capture nonlinear relationships and feature interactions.
- Gradient boosting: algorithms like XGBoost or LightGBM often perform well in time series forecasting.
- RNN: not evaluated due to the dependency of a large amount of data and computed resources.
Hyperparameter tuning: optimise the hyperparameters of the chosen models using techniques like cross-validation.
Model evaluation: evaluate model performance using metrics like mean absolute error (MAE), root mean squared error (RMSE), percentage error, and standard deviation.
Forecasting: once a satisfactory model has been found, use it to make real-time predictions or generate forecasts.

2.2.1. Feature Engineering

Feature engineering is a pivotal process involving creating or modifying features from raw data to enhance machine learning model performance. This entails selecting, transforming, or generating features to make the data more suitable for modelling, thereby improving the model’s capacity to capture patterns and relationships [3]. Machine learning models utilise feature-importance scores to identify the most impactful features on the target variable. Analysis of feature importance guides selection and engineering efforts [3,41,42]. This paper’s primary quantity of interest is the target variable “Trip Count” (y), and the model’s goal is to predict this value accurately. The relevance of the target variable is reflected in evaluation metrics like mean squared error (MSE), RMSE, or MAE, measuring how closely model predictions align with actual counts [5,34,41]. Date-related features (year, month, day of week) and weather-related features are used, providing valuable information for accurate predictions. Date features capture temporal patterns, enabling the model to recognise trends during specific months or weekends. Transforming raw dates into numerical features like year, month, and day of the week enhances the model’s interpretability and ability to recognise patterns and correlations, contributing to more accurate predictions.

2.2.2. Splitting Data

Splitting data into training and testing sets is crucial for building forecasting models with machine learning algorithms [39]. A “split variable” is obtained and is the criterion or mechanism used to divide the data set into two subsets: a training set and a testing set. Testing it on unseen data is necessary to assess how well a model performs. Not all data are used to train the model, so independent data are left for evaluation to determine how well the model is generalised to new unseen data. Splitting the data ensures that a portion of the data is reserved for testing and model evaluation [39,43]. Machine learning models can overfit training data and, hence, learn to fit the noise in the data rather than underlying patterns. Splitting the data into training and testing sets helps check for overfitting. Hyperparameter tuning techniques like grid search or random search are performed on the training data. Splitting the data to find the best hyperparameters based on the model’s performance on a separate validation set is essential. Splitting data based on periods is beneficial for forecasting models for several reasons [44]:

Seasonal patterns: splitting the data based on periods in the year allows these seasonal variations to be captured and modelled more effectively to account for varying behaviours.
Improved generalisation: Splitting the data into distinct periods can build separate models for each period. This can be particularly useful when the relationships between features (date and weather) and the target variable (trip count) vary throughout the year. For instance, weather variables might have a more substantial impact during certain seasons.
Customised models: Splitting the data by period allows the forecasting models to be tailored to each period’s specific characteristics. Using different features, hyperparameters, or even different algorithms for different time segments can enhance the accuracy of the predictions.
Evaluating seasonal performance: By splitting the data by season, the performance of the forecasting models can be evaluated separately for each season. This provides insights into how well a model can handle different parts of the year and helps identify seasonal biases or issues.

Splitting data ensures the creation of more customised and accurate models for different seasons, ultimately improving forecasting accuracy and reliability [40,44].

2.3. Model Selection

This section evaluates several SML techniques to select the SML that best predicts refuelling trips using the available data sets. All chosen models fall under the umbrella of supervised learning and vary in complexity and approach, which is desired, as the models are to be compared to select the best to use for forecasting refuelling behaviour. These models prove advantageous for handling data limitations such as missing values, outliers, and categorical features. Additionally, some of the selected models provide robustness to noise and outliers, i.e., they can filter out irrelevant information and focus on underlying patterns—an advantage when dealing with data limitations such as small sample sizes, missing values, or noisy data [4,27,45,46,47]. The models are evaluated on performance (performance metrics presented in Section 2.4), model scalability, feature engineering, hyperparameter flexibility, and robustness to assess their usability further.

2.3.1. Linear Regression

Linear regression is a simple and interpretable model that aims to find a linear relationship between the input features and the target variable [48]. Linear regression assumes that there is a linear relationship between the features and the target, that the errors are normally distributed, and that there is little to no multicollinearity among the features. In the context of forecasting trip count with date and weather data, linear regression can be used as follows:

Model equation: linear regression represents the relationship between the features (date-related and weather-related) and the target variable (bike rentals) as a linear equation, where $y$ is the predicted trip count, $b 0$ is the intercept (y value when all features = 0), and $b 1, b 2, \dots, b n$ are coefficients of the features $x 1, x 2, \dots, x n$ :

$y = b 0 + b 1 \times x 1 + b 2 \times x 2 + \dots + b n \times x n$

(1)
Training: during training, the model aims to find the values of $b 0, b 1, b 2, \dots, b n$ to minimise the MSE between the predicted and target values in the training data.
Prediction: to make predictions, values for the input features are provided, and the model calculates the predicted trip count using the learned coefficients.
Interpretability: Linear regression provides interpretable coefficients, making it easy to understand how each feature affects the target variable. For instance, the coefficient of the “temperature” feature can be analysed to see how temperature impacts trip count.
Linear regression models are scalable to large data sets and typically handle them well due to their simplicity. However, this model does not inherently perform automatic feature selection. It relies on feature engineering or regularisation techniques for feature selection. Regarding hyperparameter flexibility, linear regression has minimal hyperparameters, primarily related to regularisation techniques. These models are also sensitive to outliers and may not generalise well to unseen data if the underlying assumptions (linearity, independence of features) are violated.

2.3.2. Decision Trees

Decision trees are nonlinear models representing decisions or choices in a tree-like structure [29,45]. They are used for both classification and regression tasks. Decision trees are constructed by recursively splitting the data based on the most informative features. Each internal node represents a decision based on a feature, and each leaf node represents the predicted outcome (e.g., bike rentals). Decision trees use criteria (e.g., Gini impurity, entropy, or MSE) to determine the best feature to split on at each node. The goal is to minimise impurity or variance. To make predictions, begin at the root node and traverse the tree based on the feature values until a leaf node, which provides the predicted value, is reached [45]. The basic notation of a decision tree includes [45]:

An internal (decision) node representing a decision or test on a feature.
A leaf (terminal) node representing the final prediction or classification.
An edge that represents the outcome of a decision and leads to the next node.
A branch that is the path from the root node to a leaf node.

Figure 1 Below is a simple decision tree.

In this diagram, the root node tests “Feature A” with a condition (e.g., less than or equal to 30). If the condition is true, it follows the left branch to the “Class 0” leaf node. If the condition is false, it follows the right branch to another decision node. The process continues until a leaf node is reached, which then assigns a class label or prediction. Overall, decision trees are interpretable, easy to visualise, and can capture complex decision boundaries in the data [45]. A simplified example of applying a decision tree (in relation to this study) using a single feature, “Temperature” predicts whether people will travel on a particular day based on temperature conditions. It should be noted that the numbers used in the decision tree model are obtained from the derived data set (see Table 1). During training, the decision tree model extracts these numbers and uses them to create “decisions”. In essence, decision trees are like a flow chart of decisions, with a question or decision to be made with one possible answer or outcome—a series of “if/else” questions leading to a final choice. As indicated in previous work [45], having hundreds of these trees put together with other features of note allows accurate predictions to be made.

A simplified case is presented in Figure 2, where the decision making starts with a question, for example, whether the temperature is less than or equal to 40; if it is, then the trip count is 10,000, else it is 15,000. It should be noted that this is a highly simplified example for illustrative purposes.

In Figure 2, the root node represents the initial decision based on the “Temperature” feature. The condition is “Temperature ≤ 30 °C”. If the condition at the root node is valid (e.g., the temperature is 28 °C), the left branch is followed, leading to the “Trip Count = 20,000” leaf node. This means there is less likelihood of travel on a day with a temperature of 28 °C or less. If the condition at the root node is false (e.g., the temperature is 35 °C), the right branch is followed, leading to another decision node based on the “Temperature” feature with the condition “Temperature ≤ 40 °C”. If this condition is true, it leads to the “Trip Count = 10,000” leaf node (indicating that people are less likely to travel on hotter days). If this condition is false, it leads to the “Trip Count = 15,000” leaf node (indicating that people are likely to travel on days with temperatures between 30 °C and 40 °C). This logic explains the decision process of the tree; the actual model is more complex as it contains a more significant number of trees. Hence, the decision tree uses the “Temperature” feature in this example to predict trips. This is further expanded to add all the features in the tree, for example, “Day of the Week”, “Month”, and “Precipitation”. The decision tree demonstrates how to incorporate multiple features, such as “Day of the Week”, “Month”, and “Temperature”, to make predictions about trip count. The tree splits the data based on the conditions of each feature and assigns predictions at the leaf nodes based on the combinations of feature values and conditions. In practice, decision trees have more complex structures, with numerous branches and conditions, to capture relationships between multiple features and the target variable [26,45]. The decision tree forms the basis for all the other models to follow. However, since decisions lead to further branching, allowing nuanced relationships between different variables to be captured, it is not possible to physically visualise this as trees may overlap and clutter. Hence, Figure 1 and Figure 2 are presented to add context.

2.3.3. Random Forest

Random forests are an ensemble of decision trees that combine the predictions of multiple individual trees. Random forests use a technique called bootstrap aggregating (“bagging”). They create numerous decision trees by randomly sampling the training data with replacement. Each tree is trained on a different subset of the data. In addition to bagging, this technique introduces randomness by considering only a random subset of features at each split [49]. This prevents overfitting and de-correlates the trees. When making predictions, each tree in the random forest independently predicts the outcome. The final prediction is obtained by taking a majority vote (for classification) or an average (for regression) of the individual tree predictions. Random forests are robust against overfitting, handle high-dimensional data well, and can capture complex relationships in the data. In its application, each tree in the random forest is a decision tree that uses the “Temperature” feature to predict trip counts. Each tree can produce a different prediction based on its unique structure and the data on which it was trained. To make a prediction using the random forest, input data (e.g., the temperature on a specific day) is first passed through each tree. Each tree produces its prediction [49].

Mathematically, if there are N trees in the random forest and each tree makes a prediction,

P_{n}

, the final prediction,

P_{f i n a l}

, can be calculated as shown in Equation (2):

P_{f i n a l} = M o d e (P_{1}, P_{2}, \dots, P_{n})

(2)

where

M o d e

represents the mode function, which finds the most frequently occurring class among the trees’ predictions. In a regression task where a numerical value (e.g., the number of trip counts) is being predicted, the final prediction is often calculated as the mean (average) of the predictions from all the trees:

P_{f i n a l} = \frac{(P_{1} + P_{2} + \dots + P_{n})}{n}

(3)

This ensemble approach helps improve the model’s accuracy, generalisability, and robustness by reducing the impact of overfitting that can occur with a single decision tree.

In terms of model selection features, decision tree and random forest models perform well performance wise, as they can capture complex nonlinear relationships and interactions between features. Additionally, they can handle large data sets but may experience scalability issues if the number of trees in a random forest is large. The models can select features by considering subsets of features at each split and then identifying the most informative ones for prediction. These models are also robust against overfitting thanks to ensemble averaging and feature randomisation, but it is noted that they could struggle with highly imbalanced data sets.

2.3.4. Gradient Boosting

Like random forest, gradient boosting is an ensemble learning technique that combines the predictions of multiple weak learners, usually decision trees, to create a robust predictive model. It works by iteratively improving the model’s predictions based on the errors made in previous iterations [24]. For example, with the same “Temperature” feature introduced earlier, each iteration focuses on correcting the error made in the previous iterations. Initially, the model starts with a simple tree (Tree 1) and makes predictions based on it. In the second iteration (Tree 2), the model tries to correct errors made by Tree 1. The technique learns which areas of Tree 1 were incorrect and then creates a new tree to address those errors. Similarly, in the third iteration (Tree 3), the model focuses on correcting the errors made by Tree 1 and Tree 2. It creates a new tree that further refines the predictions. Gradient boosting combines the forecasts of individual trees in a weighted manner. The final prediction is the sum of the forecasts from all iterations (trees) multiplied by a learning rate (α), which controls the contribution of each tree [24].

P = α \times P_{1} + α \times P_{2} + \dots + α \times P_{n}

(4)

where

P n

represents the prediction made by the nth tree in the ensemble. Alpha is the learning rate, typically a small value like 0.1 or 0.01; it controls the step size of each iteration and helps prevent overfitting.

Gradient boosting effectively combines the strengths of multiple models and often produces highly accurate predictions. It is commonly used in both regression and classification tasks. Popular implementations include XGBoost and LightGBM. XGBoost is a popular gradient-boosting algorithm widely used for time series forecasting. It is like LightGBM regressor and offers advantages such as fast and efficient training and handling of missing data. XGBoost also supports time series-specific features such as lagged variables and rolling window statistics. In terms of performance, XGBoost is generally comparable to LightGBM regressor, and the choice between the two may come down to personal preference or specific requirements of the problem [39]. LightGBM, on the other hand, stands out for its exceptional speed and memory efficiency. Its algorithmic approach is fundamentally different from XGBoost. LightGBM uses a gradient-based one-side sampling (GOSS) technique and exclusive feature bundling (EFB), resulting in a histogram-based approach for tree growth [24,26,40]. This histogram-based approach allows LightGBM to efficiently handle large data sets with millions of rows and columns. Unlike XGBoost, LightGBM can handle categorical features natively, eliminating the need for one-hot encoding and saving memory and computational resources. While LightGBM might not always match XGBoost’s predictive accuracy, it excels in training speed, making it particularly well suited for large-scale and time-sensitive applications. LightGBM’s simplified parameter tuning approach and potential for faster model development make it a popular choice [24]. Additionally, the model offers robustness against overfitting, thanks to regularisation and early stopping techniques.

2.4. Hyperparameter Tuning, Code Development, and Model Performance Metrics

Hyperparameter tuning is used in SML models to set or configure specified settings before training the model. These are not learned from the data. It is essential to select the right hyperparameter since it significantly impacts a model’s ability to generalise from training data to new unseen data. It should be noted that each model comes with default hyperparameters that can be accessed on model documentation [50]. The models are all evaluated using the latest Python programming language using the following Python libraries:

Pandas: used for data extraction, transformation, and analysis.
Numpy: a statistical and computational library used for determining standard deviation.
Matplotlib: for all plotting of graphs and figures.
Sklearn: used for linear regression, decision trees, random forest models, and performance metrics.
XGboost: used for the XGboost gradient boosting model.
LightGBM: used for the LighGBM gradient boosting model.

Figure 3 presents the overall model framework outlining the above-detailed method and content. The framework also presents the process of running the model and the essential parts involved in the model training process.

An overview of the code structure designed, using the model framework (Figure 3) as a basis, is presented in Figure 4:

The code begins by loading the time series data set using the Pandas library, which extracts additional temporal features, including month, day of the week, and day. It then defines a function, a model which performs time series forecasting. Within this function, the data are split into feature and target sets, and a portion of the data is reserved for testing. Hyperparameter tuning is conducted using GridSearchCV to identify the hyperparameters through cross-validation. Using Python, this involves systematically searching through a predefined hyperparameter grid, training the model with different combinations of hyperparameter values, and selecting the set that performs the best. GridSearchCV automates this process and performs cross-validation to ensure the model’s generalisation performance. With the optimal values, the model is trained on the training data. Predictions are generated based on the testing data, and statistical performance checks are calculated. Also, the following model performance metrics are considered to evaluate the performance of the various SML models [51,52]:

MSE is a metric used to measure the average squared difference between the predicted values and the actual (ground truth) values. It quantifies the average squared “errors” or discrepancies between predictions and true values. A lower MSE indicates better model performance, meaning the predictions are closer to the actual values.
MAE is a metric used to measure the average absolute difference between predicted and actual values. It quantifies the average absolute “errors” or discrepancies between predictions and true values. Like MSE, a lower MAE indicates better model performance.
R-squared or coefficient of determination measures the proportion of the variance in the dependent variable (target) explained by the independent variables (predictions). It provides a value between 0 and 1, where 0 indicates that the model does not explain any variance, and 1 indicates that the model perfectly describes the variance. A higher R-squared value suggests better model performance.
Standard deviation calculates the prediction’s standard deviation compared with the actual data. It measures the spread or dispersion of data points around the mean (average) value. A higher standard deviation indicates more significant variability in the data.
Percentage error measures how much the residuals vary relative to the average count of the target variable. This relative error can help assess the dispersion of residuals relative to the overall scale of the target variable.

In the code context, these metrics, and calculations assess the model’s performance and gain insights into how well the model’s predictions align with the actual data. Lower MSE and MAE values and higher R-squared values indicate better model performance. At the same time, the error calculation provides an additional measure of the variability in residuals relative to the mean of the “trip count” variable.

2.5. Assumptions and Limitations

2.5.1. Assumptions

This study makes use of the following model requirements and assumptions to ensure that the data extracted and transformed match HFV vehicle profiles:

This study considered only trips by distance in NYC county, specifically for 2019 (pre-COVID-19).
Features used to assist the model trip counts are temperature and precipitation as has been noted in previous work [53].
For a trip, it is assumed that HFV drivers will refuel at the end of a trip; thus, these data are used to extract driving trends.
An HFV is assumed to be able to travel 320 km before requiring refuelling. This is based on the typical trip range taken from the data set to comply with the typical range a hydrogen vehicle would need to travel before requiring refuelling [11,18].
An HFV driver refuels when the refuelling tank is at 30% capacity.

2.5.2. Limitations

This study has certain limitations that were considered, including:

Data quality: The accuracy of the predictions depends heavily on the data quality used to train the model. If the data used in this study are inaccurate or biased, the model’s performance may suffer.
Data volume: SML algorithms require large amounts of data to train effectively. Although the data set used in this study is relatively small in relation to the refuelling behaviour of general vehicle users, if smaller data sets are used, the model may not generalise well to new data.
Changing behaviours: Refuelling behaviour may change over time due to new technologies and fuel sources becoming available. As a result, the model may become less accurate over time and require regular updates to maintain its performance.
Difference between conventional vehicles and HFVs: There are differences between traditional vehicles and HFVs. For example, the refuelling process for HFVs involves a different method and typically takes longer than conventional refuelling. Therefore, data on HFV refuelling times and equipment specifications can be integrated into machine-learning models. Additionally, the SML models can be tailored to account for HFV-specific factors such as fuel efficiency, energy consumption rates, and the impact of external factors like temperature on hydrogen storage, ensuring more accurate predictions of individual HFV refuelling behaviour.
Simplicity of models: although linear regression provides transparent insights into the linear relationships between variables and decision trees and offers an intuitive understanding of feature importance, the two models are simple and have limitations in capturing complex nonlinear patterns.

3. Results

This section presents the key findings. All SML models presented in the previous section were evaluated in relation to the performance metrics outlined above. The results of each model are given below, whereafter the most apt SML technique was selected and trained to predict refuelling trips.

3.1. Linear Regression Model Design and Results

As the linear regression model was chosen for its simplicity and ease of implementation, it served as the reference point for other models. Although the conventional 80% training and 20% testing ratio is typically recommended, it proved ineffective due to the limited data set. Consequently, a sensitivity analysis was conducted to identify the optimal “split variable” for each model, aiming to minimise MAE and establish the most effective training ratio. In this instance, the lowest MAE occurred at a split variable 91, corresponding to a 70% training and 30% testing ratio, as shown in Figure 5.

Utilising this split variable allowed for the attainment of optimal MAE and the model’s peak performance. Figure 6 compares actual and predicted data using the linear regression model, revealing a percentage error of 24.46%.

3.2. Decision Tree Design and Result

Applying the same approach as the linear regression model, the identified split variable for this method was 158, corresponding to a training ratio of 48% and a test ratio of 52%. This distribution was suboptimal for a machine learning model, as the lower training ratio indicated a larger portion reserved for testing. This configuration increased the risk of overfitting, potentially limiting the model’s capacity to capture intricate data patterns, especially if the training set inadequately represented the overall data distribution due to data scarcity.

The suboptimal train and test ratios were a consequence of insufficient data. The model achieved its lowest MAE at 3281.437, observed at a split variable 158, as illustrated in Figure 7 and Figure 8.

Using this split variable to achieve optimal MAE and, consequently, peak model performance, the decision tree model’s actual and predicted data were compared. The calculated percentage error was 24.15%, slightly outperforming the linear regression model.

3.3. Random Forest

In the random forest technique, the identified split variable was 160, corresponding to a MAE of 2735.608. This split variable translated to a training ratio of 47% and a test ratio of 53%. Although suboptimal, like previous models, this distribution was influenced by the limited data set. Figure 9 and Figure 10 illustrate the results. The calculated percentage error was 21.79%, indicating improved performance compared with the decision tree model.

3.4. Gradient Boosting: XGBoost

In evaluating the gradient boosting model, the split variable was at 123, as shown in Figure 11.

This correlated to a training ratio of 59% and a test ratio of 41%. The percentage error between predicted and actual was 24.38%, worse than the random forest model, as shown in Figure 12.

The large differences between predicted and actual values at some data points were attributed to the limited data set. Some of the data had outliers; however, since it was essential to preserve data integrity, and outliers may have represented genuine data points that were valid and meaningful, these were maintained to ensure no valuable information was discarded at the expense of the model.

3.5. Gradient Boosting: LightGBM

LightGBM achieved an optimal MAE of 2758.255 with a split variable of 111, corresponding to a training ratio of 63% and a test ratio of 37% (see Figure 13). This represented the most favourable testing and training ratio.

This resulted in a percentage error of 19.40%, showcasing the most significant improvement compared with other models. Figure 14 depicts this.

A training ratio of 80% and a test ratio of 20% were employed for the evaluation. The recorded percentage error of 19.40% was deemed acceptable, suggesting that the model’s performance remained promising, mainly when applied to larger data sets. Furthermore, it is crucial to highlight that the recorded error percentage was based on rigorous model evaluation and validation processes. While the error value may appear significant, it is believed it was within an acceptable range considering the complexity of the problem domain and the available data resources [54]. It should be noted that the achieved outcomes were scrutinised against actual data points to gauge the model’s effectiveness, highlighting the disparities between predicted and actual values. Specifically, the MAE evaluation metric was utilised to gauge the model’s accuracy concerning the chosen hyperparameters. This initial assessment served a dual purpose: firstly, to validate the reliability of predicted outcomes, and secondly, to facilitate fine-tuning of the model for enhanced results. It is emphasised in [5,55] that ensuring the accuracy of predictive models is pivotal, as accuracy directly influences the quality of predictions.

3.6. Performance Model Metrics Analysis

Table 2 summarises and compares the performance metrics of the models studied.

The models were compared in relation to MAE, MSE, RMSE, R-squared, standard deviation, and percentage error.

MAE: This metric gauged the average absolute difference between predicted and actual values. A lower MAE signified better accuracy. The random forest model stood out with the lowest MAE in Table 2, indicating its superior accuracy in predicting trip counts compared with other models.
MSE: MSE computed the average squared differences between predicted and actual values. Higher MSE suggested more significant errors and sensitivity to outliers. LightGBM displayed the lowest MSE (16,420,069.79), indicating more minor squared errors and a potential penalty for more significant errors.
RMSE: The square root of MSE provided an error measure in the same units as the target variable. Like MSE, lower RMSE values indicated better accuracy. LightGBM again excelled, with the lowest RMSE (4052.16), implying more accurate predictions.
R²: This measured the proportion of the variance in the target variable explained by the model. Values closer to 1 signified a better fit. Random forest had the highest R-squared value (0.509), followed closely by LightGBM (0.503), indicating that these models explained more of the variance in trip counts than others.
Standard deviation of the error: It gauged the spread or variability of prediction errors. A lower standard deviation suggested more consistent and less variable predictions. LightGBM showcased the lowest standard deviation (3667.53), indicating its predictions were more consistent across different time points.
Percentage error: A lower error percentage implied better accuracy. LightGBM achieved the lowest error percentage (19.40%), suggesting it deviated the least from actual counts on average. The results favoured random forest and LightGBM, with LightGBM performing slightly better overall. This indicated its effective learning of underlying patterns in time series data, handling temporal and nonlinear dependencies more accurately than other models.

Also, it is highlighted that a lower split variable value indicated a more significant portion of the data set allocated to training. In contrast, a higher split variable value indicated a more significant portion allocated to testing. The split variable did not directly represent model complexity or stability; rather, it affected how much data the model was trained on versus tested on. A lower MAE indicated better performance, representing the average magnitude of errors between predicted and actual values, and the LightGBM model exhibited this.

3.7. LightGBM Model Output

Based on this comparison, LightGBM was taken forward and trained to predict refuelling trip counts. Figure 15 presents a snippet of the LightGBM model trained to predict future vehicle refuelling trips of HFVs. This image only shows one of the split features and its tree structure; other features like day, month, temperature, and precipitation were considered.

In this part of the tree structure (Tree 0), the decision-making process was based on the “Day of the Week” feature. When a data point reached this node in the tree, it compared the value of the feature to a threshold of 2.5. If the feature value was less than or equal to 2.5, the left child node (1) path was chosen, and the predicted value for this leaf was 19,529.85. The leaf had a weight of 27, indicating the number of data points that reached this leaf and the count of 27, which was the same as the weight in this case. Conversely, if the feature value was more significant than 2.5, the right child node (2) path was taken, and the predicted value for this leaf was 19,644.09. This process represented a binary split in the decision tree based on the “Day of the Week” feature.

3.8. Feature Importance

Feature importance is a metric used to determine how much each feature contributes to the predictive performance of a model. The LightGBM algorithm was created using Python and relied on libraries such as Pandas and LightGBM. During the first iteration, the model’s hyperparameters, including day, temperature, day of the week, month, and precipitation, were considered to understand their impact on the model. The day attribute referred to the numerical day of the month, ranging from 0 to 31. For example, if the DateTime object was set to 13 March 2023, the day attribute would be 13. The day of the week attribute was an integer ranging from 0 to 6, where 0 means “Monday” and 6 represents “Sunday”. So, if the DateTime object was 13 March 2023, the day of the week attribute would be 6, indicating that it was a Sunday. The results are shown in Figure 16, which indicates that each feature played a significant role in training the model, with temperature having the highest significance and precipitation the lowest. Additionally, wind contribution was deemed negligible as no actual variations were noted from the data.

The LightGBM model was run for a future period. Since the data set was limited to only one year, there were limitations to the relevance of its predictions. Ideally, at least 20 years of data are required. Nevertheless, the concept was proven, and the model was validated. To forecast the future trip counts, the following steps were followed:

Import the trained LightGBM model developed.
Create a new DataFrame (data_2020) for 2020 with the same features used during model training.
Load the trained LightGBM model.
Use the loaded model to make predictions for the year 2020 using the features in the data.
Create a new data frame to store the predictions and the corresponding date.
Visualise the predictions for the year chosen using Matplotlib.
Plot.

Accurate trip count values for 2020 can now be predicted using the designed model, as shown in Figure 17.

Overall, the LightGBM model emerged as the most effective predictive model due to its ability to handle time series and seasonal data. This study integrated various input variables, including refuelling metrics, day of the week, and weather conditions, to capture intricate patterns and relationships within the data set. This was the primary outcome of this paper and the model itself. This model was designed based on data from 2019 and successfully compared the 2019 model to actual data for that year. Since limited data were available for HFVs, data from 2019 were used since they were found to be the most consolidated. From Figure 17, it is evident that future year forecasts, such as 2020, can be predicted accurately. This becomes especially important for HFVs, as very few data exist to conduct research with. Additionally, the empirical testing and validation against real-world refuelling data underscore the efficacy of the LightGBM model, demonstrating a minimal deviation from actual data given limited data and showcasing its potential.

4. Conclusions

This study aimed to develop an accurate predictive model for estimating the refuelling behaviour of HFVs. Four machine learning models were considered (linear regression, decision tree, gradient boosting, and LightGBM) for their capacity to handle nonlinear relationships and diverse data types. Each model was trained, evaluated, and selected based on performance metrics such as MAE, RMSE, and MSE. LightGBM provided the most accurate results, so it was chosen to construct an HFV-specific predictive model. The model proved to be the most effective predictive model due to its ability to handle time series and seasonal data. Empirical testing and validation against real-world refuelling data underscored the efficacy of the LightGBM model with only a 19.4% deviation between predicted and actual data despite using a small data set for this proof of concept. This precision indicated the model’s reliability in forecasting refuelling patterns with limited data available. With a larger data set, improvements in accuracy are expected. The developed model can fine-tune predictions with limited data and influence the decision-making process related to infrastructure development, such as determining the optimal placement of refuelling stations.

While conventional vehicle refuelling patterns served as a starting point, the models developed in this paper considered factors specific to HFVs, such as their range limitations, refuelling patterns, and potential sensitivity to temperature. This study’s novelty is particularly noteworthy in the HFV domain, where machine learning and LightGBM applications have not been extensively explored [56,57]. Also, developing models for HFVs is essential because of the scarcity of data specific to these vehicles [12,58]. This paper fills this gap by developing a practical model based on the data available and key assumptions. This study also reveals that weather conditions, mainly temperature, significantly impact predicting refuelling patterns. Thus, environmental factors should be considered while planning and managing refuelling infrastructure to pave the way for more sustainable strategies.

This study suggests the inclusion of additional variables, such as traffic conditions, driving patterns, and user preferences specific to HFVs, to enhance the model’s performance. Expanding this study to include a larger data set and a broader geographical area could provide a more comprehensive understanding of HFV refuelling behaviour. The prediction model presented in this study acknowledges the challenges of real-world data sets and offers a means to estimate future trip counts and predict hydrogen fuel consumption, which is particularly relevant in regions lacking such data, such as developing countries like South Africa.

This research is unique because it constructs an HFV-specific predictive model from a relatively small data set using established SML frameworks and the LightGBM model. This research contributes to advancing the prediction of refuelling behaviour for HFVs by emphasising the integration of weather and temperature variables for more accurate and sustainable infrastructure development.

Author Contributions

Conceptualisation, N.I. and A.K.S.; methodology, N.I.; software, N.I.; validation, N.I.; formal analysis, N.I.; investigation, N.I.; resources, N.I.; data curation, N.I.; writing—original draft preparation, N.I.; writing—review and editing, A.K.S.; visualisation, N.I.; supervision, A.K.S.; project administration, A.K.S.; funding acquisition, A.K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

ARIMA	Auto-Regressive Integrated Moving Average
EDA	Exploratory Data Analysis
HFV	Hydrogen Fuel Vehicle
LightGBM	Light Gradient Boosting Machine
MAE	Mean Absolute Error
MSE	Mean Squared Error
NYC	New York County
R²	R-Squared
RMSE	Root Mean Squared Error
RNN	Recurrent Neural Networks
SML	Supervised Machine Learning
VARMA	Vector Auto-Regressive Moving Average

References

IEA. CO₂ Emissions from Fuel Combustion: Overview; IEA: Paris, France, 2020; Available online: https://www.iea.org/reports/co2-emissions-from-fuel-combustion-overview (accessed on 12 January 2024).
Apostolou, D.; Xydis, G. A literature review on hydrogen refuelling stations and infrastructure. Current status and future prospects. Renew. Sustain. Energy Rev. 2019, 113, 109292. [Google Scholar] [CrossRef]
Bethoux, O. Hydrogen fuel cell road vehicles: State of the art and perspectives. Energies 2020, 13, 5843. [Google Scholar] [CrossRef]
Liu, H. Sustainable road network design considering hydrogen fuel cell vehicles. Sci. Rep. 2023, 13, 21947. [Google Scholar] [CrossRef] [PubMed]
DSI. Hydrogen Society Roadmap for South Africa 2021 Securing a Clean, Affordable and Sustainable Energy. 2021. Available online: https://www.dst.gov.za/images/South_African_Hydrogen_Society_RoadmapV1.pdf (accessed on 12 January 2024).
Grüger, F.; Dylewski, L.; Robinius, M.; Stolten, D. Carsharing with fuel cell vehicles: Sizing hydrogen refueling stations based on refueling behavior. Appl. Energy 2018, 228, 1540–1549. [Google Scholar] [CrossRef]
Abbaas, O.; Ventura, J.A. A lexicographic optimization approach to the deviation-flow refueling station location problem on a general network. Optim. Lett. 2022, 16, 953–982. [Google Scholar] [CrossRef]
Kim, J.G.; Kuby, M. The deviation-flow refueling location model for optimizing a network of refueling stations. Int. J. Hydrogen Energy 2012, 37, 5406–5420. [Google Scholar] [CrossRef]
Kurtz, J.; Sprik, S.; Saur, G.; Onorato, S. Fuel Cell Electric Vehicle Driving and Fueling Behavior. 2018. Available online: https://www.nrel.gov/docs/fy19osti/73010.pdf (accessed on 12 January 2024).
Martin, E.; Shaheen, S.A.; Lipman, T.; Lidicker, J.R. Behavioral Response to Hydrogen Fuel Cell Vehicles and Refueling: A Comparative Analysis of Short-and Long-Term Exposure. 2008. Available online: http://escholarship.org/uc/item/8nv3g1k3 (accessed on 12 January 2024).
Isaac, N.; Saha, A.K. Analysis of refueling behavior of hydrogen fuel vehicles through a stochastic model using Markov Chain Process. Renew. Sustain. Energy Rev. 2021, 141, 110761. [Google Scholar] [CrossRef]
Isaac, N.; Saha, A.K. Predicting Vehicle Refuelling Trips through Generalised Poisson Modelling. Energies 2022, 15, 6616. [Google Scholar] [CrossRef]
Shin, J.; Hwang, W.; Choi, H. Can hydrogen fuel vehicles be a sustainable alternative on vehicle market?: Comparison of electric and hydrogen fuel cell vehicles. Technol. Forecast. Soc. Chang. 2019, 143, 239–248. [Google Scholar] [CrossRef]
Zhao, J.; Melaina, M.W. Transition to hydrogen-based transportation in China: Lessons learned from alternative fuel vehicle programs in the United States and China. Energy Policy 2006, 34, 1299–1309. [Google Scholar] [CrossRef]
Coppitters, D.; Verleysen, K.; De Paepe, W.; Contino, F. How can renewable hydrogen compete with diesel in public transport? Robust design optimization of a hydrogen refueling station under techno-economic and environmental uncertainty. Appl. Energy 2022, 312, 118694. [Google Scholar] [CrossRef]
Yuksel, T.; Michalek, J.J. Effects of regional temperature on electric vehicle efficiency, range, and emissions in the united states. Environ. Sci. Technol. 2015, 49, 3974–3980. [Google Scholar] [CrossRef] [PubMed]
Donkers, A.; Yang, D.; Viktorović, M. Influence of driving style, infrastructure, weather and traffic on electric vehicle performance. Transp. Res. D Transp. Environ. 2020, 88, 102569. [Google Scholar] [CrossRef]
Isaac, N.; Saha, A.K. Analysis of Refueling Behavior Models for Hydrogen-Fuel Vehicles: Markov versus Generalized Poisson Modeling. Sustainability 2023, 15, 13474. [Google Scholar] [CrossRef]
Faria, M.V.; Baptista, P.C.; Farias, T.L. Identifying driving behavior patterns and their impacts on fuel use. Transp. Res. Procedia 2017, 27, 953–960. [Google Scholar] [CrossRef]
EngagedScholarship. All Maxine Goodman Levin School of Urban Affairs Publications. In An Analysis of the Association between Changes in Ambient an Analysis of the Association between Changes in Ambient Temperature, Fuel Economy, and Vehicle Range for Battery Temperature, Fuel Economy, and Vehicle Range for Battery Electric and Fuel Cell Electric Buses Electric and Fuel Cell Electric Buses; Henning, M., Thomas, A.R., Smyth, A., Eds.; Maxine Goodman Levin: Cleveland, OH, USA, 2019; Available online: https://engagedscholarship.csuohio.edu/urban_facpub (accessed on 12 January 2024).
Tayarani, H.; Karanam, V.; Nitta, C.; Tal, G. Using Machine Learning Models to Forecast Electric Vehicle Destination and Charging Event. In Proceedings of the EVS36 International Battery, Hybrid and Fuel Cell Electric Vehicle Symposium, Sacramento, CA, USA, 11–14 June 2023. [Google Scholar]
Rosenberg, E.; Fidje, A.; Aamodt, K.; Stiller, C.; Mari, A.; Møller-holst, S. Market penetration analysis of hydrogen vehicles in Norwegian passenger transport towards 2050. Int. J. Hydrogen Energy 2010, 35, 7267–7279. [Google Scholar] [CrossRef]
Shi, J.; Narasimhan, G. Time Series Forecasting Using Various Deep Learning Models Predicting Symptom Severity and Contagiousness of Respiratory Viral Infections View Project Time Series Visualization Using Transformer for Prediction of Natural Catastrophe View Project. 2022. Available online: https://www.researchgate.net/publication/361265989 (accessed on 21 January 2024).
Chand, S.P.S.; Divya, G. A Light gradient boosting machine regression model for prediction of agriculture insurance cost over linear regression. In Advances in Parallel Computing; IOS Press BV: Amsterdam, The Netherlands, 2022; pp. 200–208. [Google Scholar] [CrossRef]
Liu, Q.; Wu, Y. Supervised Learning. In Encyclopedia of the Sciences of Learning; Springer: New York, NY, USA, 2012; pp. 3243–3245. [Google Scholar] [CrossRef]
Kotsiantis, S.B. Supervised Machine Learning: A Review of Classification Techniques. Informatica 2007, 31, 249–268. [Google Scholar]
Uddin, S.; Khan, A.; Hossain, M.E.; Moni, M.A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 2019, 19, 281. [Google Scholar] [CrossRef] [PubMed]
Opoku, R.; Obeng, G.Y.; Osei, L.K.; Kizito, J.P. Optimization of industrial energy consumption for sustainability using time-series regression and gradient descent algorithm based on historical electricity consumption data. Sustainabil. Anal. Model. 2022, 2, 100004. [Google Scholar] [CrossRef]
Rady, E.H.A.; Fawzy, H.; Fattah, A.M.A. Time series forecasting using tree based methods. J. Stat. Appl. Probab. 2021, 10, 229–244. [Google Scholar] [CrossRef]
Antipov, A.; Pokryshevskaya, B. Optimizing layouts of initial AFV refuelling stations targeting different drivers and experiments with agent-based simulations. Eur. J. Oper. Res. 2016, 249, 22–26. [Google Scholar]
George, S.; Jose, A. Generalized poisson hidden markov model for overdispersed or underdispersed count data. Rev. Colomb. Estad. 2020, 43, 71–82. [Google Scholar] [CrossRef]
Murugan, A.; de Huu, M.; Bacquart, T.; van Wijk, J.; Arrhenius, K.; Ronde, I.T.; Hemfrey, D. Measurement challenges for hydrogen vehicles. Int. J. Hydrogen Energy 2019, 44, 19326–19333. [Google Scholar] [CrossRef]
Transportation Bureau of Statistics (US). Trips by Distance. Available online: https://data.bts.gov/Research-and-Statistics/Trips-by-Distance/w96p-f2qv. (accessed on 30 January 2024).
Wunderground. New York City, NY Weather History. Available online: https://www.wunderground.com/history/monthly/us/ny/new-york-city/KLGA/date/2019-3 (accessed on 12 June 2022).
Vijayakumar, V.; Jenn, A.; Fulton, L. Low Carbon Scenario Analysis of a Hydrogen-Based Energy Transition for On-Road Transportation in California. Energies 2021, 14, 7163. [Google Scholar] [CrossRef]
Sema, B. Hydrogen: A brief overview on its sources, production and environmental impact. Int. J. Hydrogen Energy 2018, 43, 10605–10614. [Google Scholar] [CrossRef]
Lin, R.H.; Ye, Z.Z.; Wu, B.D. A review of hydrogen station location models. Int. J. Hydrogen Energy 2020, 45, 20176–20183. [Google Scholar] [CrossRef]
Moriarty, P.; Honnery, D. Prospects for hydrogen as a transport fuel. Int. J. Hydrogen Energy 2019, 44, 16029–16037. [Google Scholar] [CrossRef]
Li, J. Assessing the accuracy of predictive models for numerical data: Not r nor r2, why not? Then what? PLoS ONE 2017, 12, e0183250. [Google Scholar] [CrossRef]
Iorkaa, A.A.; Barma, M.; Muazu, H.G.A.A. Machine Learning Techniques, methods and Algorithms: Conceptual and Practical Insights. Int. J. Eng. Res. Appl. 2021, 11, 55–64. [Google Scholar]
Badri-Koohi, B.; Tavakkoli-Moghaddam, R.; Asghari, M. Optimizing Number and Locations of Alternative-Fuel Stations Using a Multi-Criteria Approach. Eng. Technol. Appl. Sci. Res. 2019, 9, 3715–3720. [Google Scholar] [CrossRef]
Kitchens, B.; Dobolyi, D.; Li, J.; Abbasi, A. Advanced Customer Analytics: Strategic Value Through Integration of Relationship-Oriented Big Data. J. Manag. Inf. Syst. 2018, 35, 540–574. [Google Scholar] [CrossRef]
Li, M.; Zhang, X.; Li, G. A comparative assessment of battery and fuel cell electric vehicles using a well-to-wheel analysis. Energy 2016, 94, 693–704. [Google Scholar] [CrossRef]
Joseph, V.R.; Vakayil, A. SPlit: An Optimal Method for Data Splitting. Technometrics 2022, 64, 166–176. [Google Scholar] [CrossRef]
Rokach, L.; Maimon, O. Decision Trees. In Data Mining and Knowledge Discovery Handbook; Springer: New York, NY, USA, 2006; pp. 165–192. [Google Scholar] [CrossRef]
Bhavsar, H.; Ganatra, A. A Comparative Study of Training Algorithms for Supervised Machine Learning. Int. J. Soft Comput. Eng. 2012, 2, 74–81. [Google Scholar]
Brandić, I.; Pezo, L.; Bilandžija, N.; Peter, A.; Šurić, J.; Voća, N. Comparison of Different Machine Learning Models for Modelling the Higher Heating Value of Biomass. Mathematics 2023, 11, 2098. [Google Scholar] [CrossRef]
Kumari, K.; Yadav, S. Linear regression analysis study. J. Pract. Cardiovasc. Sci. 2018, 4, 33. [Google Scholar] [CrossRef]
Cutler, A.; Cutler, D.R.; Stevens, J.R. Random Forests. In Ensemble Machine Learning; Springer: New York, NY, USA, 2012; pp. 157–175. [Google Scholar] [CrossRef]
Elgeldawi, E.; Sayed, A.; Galal, A.R.; Zaki, A.M. Hyperparameter tuning for machine learning algorithms used for arabic sentiment analysis. Informatics 2021, 8, 79. [Google Scholar] [CrossRef]
Botchkarev, A. Performance Metrics (Error Measures) in Machine Learning Regression, Forecasting and Prognostics: Properties and Typology. arXiv 2018, arXiv:1809.03006. [Google Scholar]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)? Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Teimouri, A.; Kabeh, K.Z.; Changizian, S.; Ahmadi, P.; Mortazavi, M. Comparative lifecycle assessment of hydrogen fuel cell, electric, CNG, and gasoline-powered vehicles under real driving conditions. Int. J. Hydrogen Energy 2022, 47, 37990–38002. [Google Scholar] [CrossRef]
Elmes, A.; Alemohammad, H.; Avery, R.; Caylor, K.; Eastman, J.R.; Fishgold, L.; Friedl, M.A.; Jain, M.; Kohli, D.; Bayas, J.C.L.; et al. Accounting for training data error in machine learning applied to earth observations. Remote Sens. 2020, 12, 1034. [Google Scholar] [CrossRef]
Lee, H.; Kim, A.; Lee, A.; Lee, B.; Lim, H. Optimized H2 fueling station arrangement model based on total cost of ownership (TCO) of fuel cell electric vehicle (FCEV). Int. J. Hydrogen Energy 2021, 46, 34116–34127. [Google Scholar] [CrossRef]
Zemo Partnership. Hydrogen Vehicle Well-to-Wheel GHG and Energy Study Zemo.org.uk 2 Hydrogen Vehicle Well-to-Wheel GHG and Energy Study Acknowledgements Authors; Zemo Partnership: London, UK, 2021. [Google Scholar]
Fayyazi, M.; Sardar, P.; Thomas, S.I.; Daghigh, R.; Jamali, A.; Esch, T.; Kemper, H.; Langari, R.; Khayyam, H. Artificial Intelligence/Machine Learning in Energy Management Systems, Control, and Optimization of Hydrogen Fuel Cell Vehicles. Sustainability 2023, 15, 5249. [Google Scholar] [CrossRef]
Lin, R.; Ye, Z.; Guo, Z.; Wu, B. Hydrogen station location optimization based on multiple data sources. Int. J. Hydrogen Energy 2020, 45, 10270–10279. [Google Scholar] [CrossRef]

Figure 1. Simple decision tree.

Figure 2. Application of decision tree.

Figure 3. Overall model framework.

Figure 4. Code model.

Figure 5. Lowest MAE at the split variable of 91.

Figure 6. Linear regression: actual versus predicted at MAE 3236.668.

Figure 7. Lowest MAE at split variable 158.

Figure 8. Decision tree: actual versus predicted at MAE of 3281.437.

Figure 9. Lowest MAE at the split variable of 160.

Figure 10. Random forest: actual versus predicted at MAE of 2735.608.

Figure 11. Lowest MAE at split variable 123.

Figure 12. XGBoost: actual versus predicted at MAE of 3295.087.

Figure 13. Lowest MAE at split variable 111.

Figure 14. LightGBM: actual versus predicted at MAE of 2758.255.

Figure 15. LightGBM model training framework.

Figure 16. Different validation parameters essential to the prediction model.

Figure 17. Predicted counts for 2020 using LightGBM.

Table 1. Sample of combined data.

Date	Temperature (High) °C	Temperature (Low) °C	Precipitation (mm)	Trip Count
1 January 2019	15.6	5.6	34.75	23,921
2 January 2019	5	1.7	0	20,922
3 January 2019	7.2	3.9	0	19,167
4 January 2019	8.3	2.8	0	20,500

Table 2. Performance model metrics.

Performance Metrics	Linear Regression	Decision Tree	Random Forest	XG Boost	LightGBM
MAE	3236.668	4114.83	2735.608	3295.087	2758.255
MSE	21,649,217.49	36,885,695.27	16,990,542.55	22,882,610.93	16,420,069.79
RMSE	4652.87	6073.35	4121.95	4783.57	4052.16
R-squared	0.34	−0.11	0.509	0.32	0.503
Std Deviation	4620.32	5493.01	4118.40	4607.45	3667.53
Error %	24.46	29.06	21.79	24.38	19.40

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Isaac, N.; Saha, A.K. Forecasting Hydrogen Vehicle Refuelling for Sustainable Transportation: A Light Gradient-Boosting Machine Model. Sustainability 2024, 16, 4055. https://doi.org/10.3390/su16104055

AMA Style

Isaac N, Saha AK. Forecasting Hydrogen Vehicle Refuelling for Sustainable Transportation: A Light Gradient-Boosting Machine Model. Sustainability. 2024; 16(10):4055. https://doi.org/10.3390/su16104055

Chicago/Turabian Style

Isaac, Nithin, and Akshay K. Saha. 2024. "Forecasting Hydrogen Vehicle Refuelling for Sustainable Transportation: A Light Gradient-Boosting Machine Model" Sustainability 16, no. 10: 4055. https://doi.org/10.3390/su16104055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Forecasting Hydrogen Vehicle Refuelling for Sustainable Transportation: A Light Gradient-Boosting Machine Model

Abstract

1. Introduction

2. Materials and Methods

2.1. Data and Analysis

2.2. Model Design

2.2.1. Feature Engineering

2.2.2. Splitting Data

2.3. Model Selection

2.3.1. Linear Regression

2.3.2. Decision Trees

2.3.3. Random Forest

2.3.4. Gradient Boosting

2.4. Hyperparameter Tuning, Code Development, and Model Performance Metrics

2.5. Assumptions and Limitations

2.5.1. Assumptions

2.5.2. Limitations

3. Results

3.1. Linear Regression Model Design and Results

3.2. Decision Tree Design and Result

3.3. Random Forest

3.4. Gradient Boosting: XGBoost

3.5. Gradient Boosting: LightGBM

3.6. Performance Model Metrics Analysis

3.7. LightGBM Model Output

3.8. Feature Importance

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI