Enhancing Road Freight Price Forecasting Using Gradient Boosting Ensemble Supervised Machine Learning Algorithm

Budzyński, Artur; Cieśla, Maria

doi:10.3390/math13182964

Open AccessArticle

Enhancing Road Freight Price Forecasting Using Gradient Boosting Ensemble Supervised Machine Learning Algorithm

by

Artur Budzyński

^1,*

and

Maria Cieśla

^2,*

¹

Department of Packaging and Logistics Processes, Institute of Quality Science and Product Management, Krakow University of Economics, 27 Rakowicka St., 31-510 Krakow, Poland

²

Department of Transport Systems, Traffic Engineering and Logistics, Faculty of Transport and Aviation Engineering, Silesian University of Technology, 8 Krasińskiego St., 40-019 Katowice, Poland

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(18), 2964; https://doi.org/10.3390/math13182964

Submission received: 28 July 2025 / Revised: 5 September 2025 / Accepted: 6 September 2025 / Published: 12 September 2025

(This article belongs to the Special Issue Evolutionary Machine Learning for Real-World Applications)

Download

Browse Figures

Versions Notes

Abstract

For effective logistics planning and pricing strategies, it is essential to predict road freight transportation costs accurately. Using a real-world dataset with 45,569 freight offers and 52 different variables, including financial, logistical, geographical, and temporal characteristics, this study presents a data-driven method for forecasting transport prices. To create a strong predictive model, the approach combines hyperparameter optimization, evolutionary feature selection, and extensive feature engineering. Because gradient boosting works well for modelling intricate, nonlinear relationships, it was used as the main algorithm. Temporal dependencies were maintained through a nested cross-validation framework with a time-series split, which improved the generalizability of the model. With a mean absolute percentage error (MAPE) of 6.27%, the model showed excellent predictive accuracy. Key predictive factors included total transport distance, load and delivery quantities, temperature constraints, and aggregated categorical features such as route and vehicle type. The results confirm that evolutionary algorithms are capable of efficiently optimizing model parameters, as well as feature subsets, greatly enhancing interpretability and performance. In the freight logistics industry, this method offers useful insights for operational and dynamic pricing decision-making. This model may be expanded in future research to include external data sources and investigate its suitability for use in various geographic locations and modes of transportation.

Keywords:

data preprocessing; evolutionary algorithms; feature engineering; freight transportation rate; gradient boosting; machine learning; predictive modelling

MSC:

68T05; 90C59; 62P30; 90B06

1. Introduction

1.1. Background and Motivation

In the context of international supply chain optimization, transportation logistics, and economic planning, accurate road freight price forecasting has become increasingly important. Given that transport efficiency is the most important indicator to improve supply chain performance [1], ineffective pricing strategies can have a substantial negative influence on shippers’ and carriers’ profitability and competitiveness. Changing market conditions, including fuel price changes, alterations in toll policies, and fluctuations in the demand for transport services, greatly affect transport costs [2,3]. Due to the intricate interaction among macroeconomic elements, fuel prices, driver shortages, seasonal variations, and geopolitical uncertainties, conventional forecasting techniques frequently fall short in addressing the dynamic and nonlinear [4,5] characteristics of freight pricing trends [6].

Traditionally, forecasting freight rates has depended on statistical and econometric models, such as ARIMA [7,8], exponential smoothing, and regression techniques [9]. Although these models provide interpretability and work well in stable settings, they face challenges in handling volatility, high dimensionality, and variable interactions typical in road freight markets [10]. With the growing accessibility of freight data from digital freight platforms, GPS, electronic logging devices (ELDs), and telematics, there is now a chance to implement data-informed models that can uncover more profound patterns and connections. The increasing complexity fuels the need for data-driven analytical tools that facilitate more accurate price forecasting. Therefore, according to [11,12], Artificial Intelligence (AI)-based methods provide better results in freight rate forecasting.

A significant challenge within the transport sector is increasing competition and the pressure to lower expenses. Transport companies need to not only align their prices with present market conditions but also guarantee operational efficiency. Fuel costs, specifically, are crucial and account for a significant share of operating expenses. Global factors causing price fluctuations can greatly influence the profitability of transportation companies [13]. Consequently, predictions of fuel prices and additional outside factors are essential for maintaining financial stability [14].

The advancement of analytical tools and machine learning algorithms presents fresh possibilities for predicting prices in the transportation industry. Algorithms for machine learning, like XGBoost (version 2.1.3), facilitate the analysis of intricate relationships among variables, leading to improved price predictions [15]. The implementation of these tools in the transportation sector could greatly enhance operational efficiency and expense control.

1.2. Research Objectives and Significance

The aim of this research is to create a predictive model that allows for accurate forecasting of road freight transport prices. This research will examine factors like distance, load and delivery quantities, cargo worth, customs regulations, and varying operational expenses. The aim is to evaluate how well various machine learning algorithms predict transportation prices and to examine their possible influence on the operational efficiency of transport businesses.

The importance of this research lies in creating contemporary techniques for predicting road transport prices, which can offer substantial advantages to businesses within the transport industry. In a fast-changing market landscape, where operational expenses are affected by various factors like fuel costs, tolls, and demand changes, conventional forecasting techniques have proven inadequate. Utilizing sophisticated machine learning algorithms can greatly enhance the accuracy of forecasts, resulting in improved cost management and more informed business choices. By utilizing more precise predictions for transport costs, transport firms can optimize their operations, enhance resource utilization, and modify service rates based on prevailing market trends, potentially leading to greater competitiveness. Additionally, the results of this research could aid in the creation of novel analytical tools that may be utilized not only in road transport but also across other logistics and supply chain management fields. The knowledge acquired from this research could assist transport industry decision-makers in grasping evolving cost trends, aiding in the formulation of long-term strategies that promote financial stability and sustainable growth for transport companies.

1.3. Research Gap

Understanding the significance of features is crucial for forecasting freight rates since these variables represent the main elements that affect pricing throughout logistics networks. Freight rates are influenced by a complex mix of factors, including distance, cargo weight and volume, mode of transport, fuel costs, seasonal patterns, and the starting and ending locations. By recognizing and integrating these attributes into predictive models, companies can obtain a better understanding of the reasons behind rate changes and enhance the precision of their predictions. Robust, pertinent attributes enable machine learning models to identify significant patterns, adjust to market fluctuations, and provide accurate forecasts. This degree of precision is essential not only for predicting costs but also for making knowledgeable choices regarding routing, carrier selection, and contract discussions. Moreover, examining feature significance can offer wider business insights, like pinpointing cost factors, grasping how economic conditions influence logistics, and discovering inefficiencies within the supply chain. In summary, features form the core of a successful freight rate prediction system, facilitating both operational effectiveness and strategic planning.

Road freight rates fluctuate and are influenced by a complicated mix of both internal and external elements. The internal features relate directly to the operational specifics and decisions of the individual freight carrier, while the external features are broader market forces and external conditions that impact the entire freight industry. Designating these influences is essential for companies engaged in shipping and logistics, as they directly affect transportation expenses and supply chain effectiveness. Table 1 presents a comprehensive list of key features influencing road freight rates, along with an explanation and the reference of the investigated studies, based on the total freight transportation costs classification presented in [16,17].

In our empirical analysis, we incorporated a wide set of features grouped into thematic categories reflecting the key determinants of freight rates. These included the following:

Distance-related features, such as total route length and national breakdowns.
Cargo-related features, including weight, volume, and temperature requirements.
Temporal features, such as loading and delivery dates, weekdays, months, and time windows.
Operational features, including the number of loading and unloading points, vehicle and body types, and service-specific conditions.
Geographical and route characteristics, covering origin–destination relations, route types, and country-specific statistics.
Derived and aggregate indicators, such as price per kilometre, mean route rates, and seasonally averaged values.

A variety of machine learning methods have been investigated and utilized for automated forecasting tasks in previous research across different fields. Among these, Multiple Kernel Learning (MKL) methods have been utilized successfully, providing strong frameworks for combining diverse data sources [42]. Using Bayesian networks for forecasting [43] offers a framework for probabilistic graphical modelling that can manage uncertainty in complex systems. In the Gradient Boosting Machines (GBMs) used in [44], prediction accuracy was achieved via ensemble learning, while in [45], k-nearest neighbour (k-NN) methods that depend on instance-based learning were examined.

A particularly noteworthy technique in the realm of fuzzy modelling is fuzzy linear regression, where regression coefficients are treated as fuzzy symmetric triangular numbers. The estimation of these coefficients involves solving corresponding linear programming problems, which are often addressed using machine learning algorithms. These methods are typically implemented in Python (version 3.12.7), leveraging its extensive ecosystem of optimization and machine learning libraries. The effectiveness of the resulting models is generally verified using control samples, and, in most cases, their adequacy has been confirmed through empirical validation. Table 2 presents a comparison table to summarize latest research papers on the application of machine learning in freight transportation, highlighting the ML models used, the specific areas of focus, and their key contributions. The provided research papers highlight the critical role of accurate freight rate prediction. These studies leverage diverse machine learning models and approaches to address specific challenges and forecast different aspects of freight logistics.

Although numerous machine learning methods have been effectively utilized for forecasting issues in transportation (especially maritime) and logistics, their usage continues to be focused on sectors such as maritime shipping, urban logistics, and connected vehicle systems. Moreover, fuzzy linear regression, especially when utilizing fuzzy symmetric triangular number coefficients, has proven effective in addressing uncertainty in transportation forecasting tasks. Nonetheless, a comprehensive examination of the current literature uncovers a notable deficiency: the use of fuzzy linear regression and similar machine learning techniques for forecasting freight rates in road transportation has not been adequately investigated or recorded.

Considering the dynamic and frequently unstable characteristics of road freight markets, impacted by regional demand variations, fuel prices, regulatory shifts, and infrastructure limitations, there is an urgent requirement for strong, flexible, and understandable forecasting models in this area. The lack of these models provides a theoretical and practical research opportunity to apply fuzzy regression-based machine learning methods to road freight rate forecasting, thus tackling an overlooked area with significant economic and operational implications.

This research focused on creating and assessing predictive models for forecasting prices in road freight transport, a field with which the authors are already familiar. Preliminary research activities were initiated as part of a PhD thesis [56] and were later extended in earlier scientific articles [57,58]. To address the research gap, in this paper, we applied a Gradient Boosting algorithm, enhanced by evolutionary feature selection and hyperparameter optimization. This approach enabled us to effectively model complex relationships in road freight pricing data and systematically select the most predictive subset of variables.

1.4. Manuscript Structure

The manuscript is structured in the following way: Section 1 provides the introduction, detailing the context and rationale for the research, followed by a review of the pertinent literature. It highlights current research gaps and outlines the primary goals of the study, finishing with an overview of the manuscript’s organization. Section 2 outlines the materials and methods used in the study. It provides a comprehensive overview of the data used and discusses the research methods, including data gathering and machine learning approaches employed for forecasting freight rate occupancy. Section 3 displays the findings of the research, highlighting the effectiveness of the models and the essential discoveries concerning freight rate prediction trends. Section 4 presents a discussion of the findings, situating them within the framework of earlier studies, analysing their significance, and emphasizing both the advantages and drawbacks of the method. Section 5 concludes the manuscript by summarizing key findings, highlighting the contributions of the research, and offering recommendations for future studies. This organized method guarantees a transparent and rational progression of information from defining the issue to drawing actionable conclusions.

2. Materials and Methods

This section outlines the methodological foundation of the study. First, the dataset and its characteristics are introduced, including the scope, structure, and preprocessing steps. Next, the applied modelling techniques are described in detail, with particular attention to feature engineering, evolutionary feature selection, and hyperparameter optimization. Finally, the evaluation protocol is presented, including the chosen error metrics and validation strategy. Together, these elements provide a transparent framework for replicating the study and for assessing the robustness of the proposed forecasting approach.

2.1. Data

The dataset utilized in this study consists of 45,569 records obtained from publicly available transport exchanges related to road freight transportation offers. It encompasses a comprehensive range of 52 variables, including quantitative, categorical, and temporal data. The specific types of variables and their corresponding data types are detailed in Table 3. These variables provide crucial insights into aspects such as geographical distribution, logistical characteristics, temporal details, and transportation costs.

2.2. Software

The research was conducted using Python, a versatile programming language widely recognized for its effectiveness in data science and machine learning tasks [59]. The primary environment for data analysis and modelling was the Jupyter Notebook (version 7.2.2) for Windows platform (version Windows-11-10.0.26100-SP0), chosen for its interactive and intuitive interface, enabling efficient code execution, documentation, and visualization of the results [60].

Several specialized Python libraries significantly facilitated the research process. Data manipulation and analysis were performed using Pandas (version 2.2.2) [61] and NumPy (version 1.26.4) [62], two fundamental libraries providing powerful data structures and efficient operations on large datasets. Visualizations, essential for exploratory data analysis and the interpretation of modelling outcomes, were created using the Matplotlib library (version 3.9.2) [63]. Additionally, the Holidays library was employed to incorporate national holidays into temporal analyses, thus enriching the contextual relevance of time-based features [64].

2.3. Research Workflow

Figure 1 illustrates the overall research workflow, covering all analytical stages from data collection to result interpretation.

The first stage involved data collection (Stage 1: Data collection). A comprehensive dataset comprising 45,569 observations of transport orders was gathered. The second stage (Stage 2: Data preprocessing) involved preliminary data processing, including error correction, inconsistency removal, and handling of missing values.

The next step (Stage 3: Feature engineering) consisted of extracting essential information and creating new variables such as total distance (TOTAL_KM) and price per kilometre (PRICE_PER_KM). In the fourth stage (Stage 4: Evolutionary feature selection), an evolutionary algorithm was employed to select the most significant features using cross-validation.

In the fifth stage (Stage 5: Hyperparameter optimization), the hyperparameters of the model were optimized through an evolutionary approach, enabling the identification of optimal settings that improved predictive performance. The sixth stage (Stage 6: Model training and evaluation) included training and validating the final model, assessing its effectiveness.

The final stage (Stage 7: Interpretation and conclusions) involved interpreting the results and deriving practical conclusions from the research, also highlighting potential directions for future work.

2.4. Missing Values

Figure 2 presents the distribution of missing values across variables. Missing values in the variables “TEMP_MIN” and “TEMP_MAX” are justified due to the nature of the goods transported, as only some products require controlled temperature. Missing values in “DOCUMENTS_BY” indicate that the preferred method of delivering documents (post or email) was often not recorded. The timestamps (“START_LOAD_TIME”, “END_LOAD_TIME”, “START_DELIVERY_TIME”, and “END_DELIVERY_TIME”) contain gaps arising from the incomplete registration of loading and unloading events. Missing values in “CARGO_TYPE” and “GOODS_TYPE” result from occasional incomplete shipment documentation, while “PAYMENT_TERM” and “TIME_OF_ENTRY” also exhibit gaps due to incomplete or unregistered entries. The identified missing values will be addressed after feature engineering, as explained in later sections.

2.5. Feature Engineering

2.5.1. Feature Extraction

Feature extraction plays an important role in preparing data for evolutionary machine learning models. To improve the models’ predictive performance and simplify analysis, several new features were carefully created from the original dataset. The dataset that we used contains 45,569 observations from real-world road freight transport orders, with each observation characterized by various numerical, categorical, and temporal attributes. A clear, step-by-step overview of the entire feature extraction process is presented in Figure 3.

We began by transforming the date-related fields (START_LOAD_DATE, END_LOAD_DATE, START_DELIVERY_DATE, and END_DELIVERY_DATE) into datetime objects. From these fields, we further extracted additional time-related information such as year, month, week number, weekday, day of the month, and day of the year. These derived features help in identifying seasonal patterns and periodic trends within the freight data.

We also created two important numerical features, TOTAL_PRICE and TOTAL_KM. TOTAL_PRICE was calculated by summing the basic transport cost in EUR with additional costs (OTHER_COSTS), providing a complete view of the transport price. TOTAL_KM was derived by summing up the distances travelled across various European countries involved in each transport order.

Lastly, to provide a practical efficiency metric useful for business analysis and modelling, we introduced PRICE_PER_KM. This metric represents the ratio of TOTAL_PRICE to TOTAL_KM, allowing for insights into the cost-effectiveness and operational efficiency of each transport order.

2.5.2. Feature Transformation and Imputation

The transformations described in Table 4 contributed significantly to preparing the dataset for accurate modelling by ensuring data consistency and interpretability. Initially, we converted raw kilometre data (from columns such as AT_KM to SK_KM) into percentages of the total distance travelled, simplifying the comparison of transport routes across various regions. Additionally, to manage gaps in our data, we imputed missing temperature values (TEMP_MIN and TEMP_MAX) using realistic observed minimum and maximum values. For features like payment terms (PAYMENT_TERM) and other numeric variables, we applied mean-based imputation to preserve dataset integrity without introducing substantial biases. These systematic adjustments enhanced data quality, establishing a robust foundation for subsequent modelling tasks.

2.5.3. Aggregation

In the aggregation step, we created new features to simplify and generalize information from categorical and time-based variables. Table 5 provides a clear summary of how we aggregated these features, detailing the original variables, the aggregation methods used, and the resulting new features.

For categorical variables, such as loading and delivery points (COD_LP and COD_DP), route type, vehicle and body types, loading/unloading methods, types of goods and cargo, and document delivery methods, we calculated the average price per kilometre (PRICE_PER_KM) for each category separately. This resulted in new features representing typical transport costs per kilometre associated with each specific category.

Additionally, we aggregated time-related variables by first rounding loading and delivery start and end times to the nearest hour and then computing the average PRICE_PER_KM for each hourly interval. This approach smoothed minor variations and highlighted broader hourly trends in transportation costs.

These aggregated features simplified the dataset and captured essential insights, improving the effectiveness of subsequent predictive modelling.

2.6. Modelling

The modelling process focused on building an accurate predictive model for road freight transport costs. We employed a Gradient Boosting algorithm, known for its robustness and high performance in regression tasks, particularly suitable for datasets with complex, nonlinear relationships [65]. Gradient Boosting constructs an ensemble of decision trees in a stage-wise manner, where each tree is fitted to the residuals of the previous ones in order to minimize the chosen loss function. Regularization mechanisms are applied to control model complexity and prevent overfitting. In general, the objective minimized by Gradient Boosting can be written in the following regularized form [66]:

L (ϕ) = \sum_{i} l ({\hat{y}}_{i}, y_{i}) + \frac{1}{2} \sum_{k} λ | | w_{k} | |^{2}

where

l ({\hat{y}}_{i}, y_{i})

represents the prediction error for observation i, and the second component

\frac{1}{2} \sum_{k} λ | | w_{k} | |^{2}

denotes the regularization term, with

λ

as a regularization parameter and

w_{k}

as the weights assigned to the model’s features.

To rigorously evaluate model performance and ensure reliable generalization, we utilized nested cross-validation (Nested CV) with a TimeSeriesSplit strategy. This method provided a realistic assessment by maintaining temporal dependencies within the data [67].

The primary evaluation metric chosen was the mean absolute percentage error (MAPE), which offers an intuitive measure of prediction accuracy, especially beneficial in business applications, as it directly conveys error magnitude in percentage terms.

2.7. Evolutionary Feature Selection

Feature selection was performed using a Genetic Algorithm (GA) implemented in the DEAP framework (deap.algorithms.eaSimple), which applies a population-based search with tournament selection, uniform crossover, and bit-flip mutation.

Feature selection was an integral part of the modelling process, implemented through an evolutionary algorithm from the DEAP library [68]. Specifically, modules such as base, creator, tools, and algorithms were utilized to manage and optimize feature subsets effectively. The evolutionary feature selection used a population of 20 individuals evolved over 10 generations, with a crossover probability of 0.6 and a mutation probability of 0.3, employing tournament selection (size = 3) and uniform crossover with a gene-level probability of 0.5. The step-by-step procedure of the evolutionary feature selection is summarized in Algorithm 1.

Algorithm 1. Evolutionary feature selection

Input: Dataset D = (X, y); population size N; generations G; crossover prob. pc; mutation prob. pm; tournament size k; uniform gene prob. p_u; regularization α
Output: Best feature subset S*
1:  Initialize population P with N binary masks over features of X
2:  For each individual z ∈ P do
3:         Decode subset S(z) ← indices where mask bit = 1
4:         Train model on features S(z) using the chosen CV protocol
5:         Compute fitness F(z) ← MAPE(S(z)) + α · |S(z)|
6:  end for
7:  for t = 1 to G do
8:         Parents ← TournamentSelection(P, size = k)
9:         Offspring ← ∅
10:        while |Offspring| < N do
11:               Select (p1, p2) from Parents
12:               With prob. pc: (c1, c2) ← UniformCrossover(p1, p2, gene prob. = pu)
13:               Else: (c1, c2) ← (p1, p2)
14:               Mutate c1, c2 by bit-flip with prob. pm per gene
15:               Repair c1, c2 if needed (e.g., ensure at least one feature selected)
16:               Evaluate F(c1), F(c2) as in lines 3–5
17:               Add c1, c2 to Offspring
18:        end while
19:        P ← Elitism(P ∪ Offspring, keep N by best fitness)
20: end for
21: Return S* ← S(argmin_z F(z))

The predictive accuracy of the models was evaluated using the mean absolute percentage error (MAPE), a standard error metric widely applied in forecasting. The MAPE is defined as

MAPE = \frac{100 %}{n} \sum_{t = 1}^{n} \frac{|A_{t} - F_{t}|}{|A_{t}|}

where

A_{t}

and

F_{t}

denote the actual and forecast values, respectively.

In addition to the MAPE, we also report the symmetric mean absolute percentage error (sMAPE), which modifies the denominator to be the mean of the actual and forecasted values. This formulation symmetrically penalizes both over- and under-predictions and ensures that the error metric is bounded between 0% and 200%. As Hyndman and Koehler (2006) emphasize, “sMAPE treats positive and negative forecast errors equally, thereby providing a more balanced and interpretable measure of forecast accuracy” [69]. Formally, the sMAPE is defined as

sMAPE = \frac{100 %}{n} \sum_{t = 1}^{n} \frac{|F_{t} - A_{t}|}{(|A_{t}| + |F_{t}|) / 2}

This formulation reflects the principle found in both penalized regression frameworks and Pareto-based multi-objective evolutionary feature selection: the aim is to maximize predictive performance while minimizing the number of features. In this study, the evolutionary feature selection approach substantially reduced dimensionality, focusing the model on the most informative features. This not only improved predictive accuracy but also enhanced interpretability by simplifying the set of predictors. The approach ensured that the selected features were stable across validation folds, providing a robust foundation for subsequent analysis and discussion.

2.8. Evolutionary Hyperparameter Optimization

Hyperparameter optimization was carried out with a Genetic Algorithm (GA) implemented in DEAP (deap.algorithms.eaSimple), where each individual encoded a candidate hyperparameter vector, and evolution proceeded via tournament selection, crossover, and mutation.

We performed hyperparameter tuning using an evolutionary algorithm implemented with the DEAP framework (base, creator, tools, and algorithms) [70]. This approach enables systematic exploration of large, mixed-type, and conditionally constrained search spaces while controlling model complexity.

Let

θ \in Θ

denote the hyperparameter vector of the predictive model and

S

the feature subset selected in Section 2.7. The step-by-step procedure of the evolutionary hyperparameter optimization is summarized in Algorithm 2.

Algorithm 2. Evolutionary hyperparameter optimization

Input: Feature subset S; search space H; population size N; generations G;
crossover probability pc; mutation probability pm; tournament size k;
per-gene mutation probability p_u
Output: Best hyperparameter vector θ*
1:  Initialize population P with N random hyperparameter vectors sampled from H
2:  For each individual θ ∈ P:
3:         Train model with parameters θ on S using CV respecting temporal order
4:         Compute penalized validation loss F(θ) = E_cv(θ, S) + λ · C(θ)
5:  End for
6:  For t = 1 to G do:
7:         Parents ← TournamentSelection(P, size = k)
8:         Offspring ← ∅
9:         While |Offspring| < N do:
10:               Select (θ1, θ2) from Parents
11:               With prob. pc: perform crossover to get (φ1, φ2), else copy parents
12:               Mutate φ1, φ2 per gene with prob. pu (Gaussian for real-valued, step for integer, resample for categorical)
13:               Repair infeasible parameters (respect constraints in H)
14:               Evaluate F(φ1), F(φ2) as in lines 3–4
15:               Add φ1, φ2 to Offspring
16:        End while
17:        P ← Elitism(P ∪ Offspring, keep N best by F)
18:  End for
19:  Return θ* ← argmin_θ F(θ)

We optimized a penalized validation loss:

\min_{θ \in Θ} \bar{L} (θ; S) + λ C (θ),

where:

$\bar{L} (θ; S)$ is the cross-validated error;
$C (θ)$ is a capacity/complexity proxy;
$λ > 0$ balances accuracy and parsimony.

The space

Θ

includes continuous, integer, and categorical parameters. Individuals are encoded per gene type (real, integer, and categorical). Conditional dependencies between parameters are enforced via a feasibility repair step that projects offspring back onto

Θ

.

Each candidate

θ

is evaluated within a unified pipeline (from preprocessing to model). All preprocessing steps are fitted exclusively on the training folds to prevent leakage. Cross-validation is used to compute

\bar{L}

; where applicable, the splitting strategy respects temporal ordering. Evaluation caching avoids recomputation for duplicate configurations, and all random seeds are fixed for reproducibility.

We apply uniform crossover and per-gene mutation (Gaussian for real-valued genes; integer step for count genes; and random re-sampling for categorical genes), followed by feasibility repair. Selection is elitist (e.g., tournament), and the run terminates upon exhausting a pre-set evaluation/generation budget or on stagnation of the best fitness within a patience window. This evolutionary procedure complements Section 2.7 by jointly promoting predictive accuracy and model parsimony through explicit penalization of complexity.

2.9. Ablation Study

An ablation study is a standard methodological tool in empirical machine learning, applied to evaluate the contribution of individual components of a model. The principle is to systematically remove or modify parts of the system and analyse the resulting change in performance. This approach enhances interpretability and clarifies whether the reported improvements can indeed be attributed to the proposed innovations. As Lipton [70] notes, “without careful ablation, it is difficult to establish whether reported gains stem from the claimed mechanisms or uncontrolled factors”. Similarly, Montavon [71] emphasizes that model interpretation methods are indispensable for attributing performance changes to specific mechanisms rather than coincidental interactions. Ablation experiments are thus considered best practice in empirical machine learning, as they strengthen scientific validity and the transparency of the results.

3. Results

3.1. EDA

Figure 4 shows the distribution of freight service prices (EUR). The distribution is right-skewed, indicating that most service prices are concentrated around lower values, with fewer high-value transactions. The average price (mean) is 818.92 EUR, while the median is slightly lower at 753.00 EUR, further confirming skewness due to higher-priced outliers. Additionally, the 95th percentile is marked at 1600.00 EUR, highlighting that only 5% of services exceed this amount.

Figure 5 presents a categorical Pearson correlation heatmap (upper triangle) in which colour encodes the correlation strength—strong (|r| ≥ 0.70), moderate (0.30 ≤ |r| < 0.70), and weak (|r| < 0.30). The annotations report the coefficient and its significance (t-test; * p < 0.05, ** p < 0.01, *** p < 0.001). The total route distance (TOTAL_KM) shows a strong positive association with price (EUR; r = 0.92 **). Shipment complexity measures—the number of loads (QTY_LOADS, r = 0.46 ***) and deliveries (QTY_DELIVERIES, r = 0.46 ***)—are moderate. Other numerical features, including volume (M3, r = 0.19) and height (HEIGHT, r = 0.19), are weak and thus of limited standalone predictive value. These patterns informed the subsequent evolutionary feature selection by prioritizing distance and shipment quantity features.

3.2. Evolutionary Feature Selection with Nested Cross-Validation

The effectiveness and stability of the proposed evolutionary feature selection method were assessed using nested cross-validation with a TimeSeriesSplit approach. Figure 6 illustrates the mean absolute percentage error (MAPE) obtained across five distinct cross-validation folds. The evolutionary selection process substantially reduced the dataset dimensionality, selecting between 21 and 25 features from an initial set of 57 candidate predictors. The average MAPE across all validation folds was 7.54%, demonstrating excellent forecasting performance and high stability. Individual fold results ranged from 6.76% to 8.34%, underscoring the robustness and consistency of the evolutionary feature selection method when applied to predicting road freight transportation prices.

A frequency analysis of the features selected by the evolutionary algorithm across the nested cross-validation folds is shown in Figure 7. The analysis reveals that several features were consistently selected across all five folds, indicating their high predictive power and stability in forecasting road freight transport prices. Notably, the features “TEMP_MIN”, “HU_KM_PERC”, “TEMP_MAX”, “START_DELIVERY_TIME_MEAN_PRICE_PER_KM”, “TOTAL_KM”, “ENTRY_WEEKDAY_MEAN_PRICE_PER_KM”, “COD_DP_MEAN_PRICE_PER_KM”, and “COD_LP_MEAN_PRICE_PER_KM” were selected in all five folds. These consistently selected features suggest their critical importance in influencing transport prices. Features selected less frequently, such as “START_LOAD_TIME_MEAN_PRICE_PER_KM” or “WIDTH,” likely have context-dependent or minor effects. This detailed insight emphasizes the strength and stability of the evolutionary feature selection approach, providing clear guidance for selecting relevant predictors for practical forecasting models.

3.3. Results of Evolutionary Hyperparameter Optimization

This study employed an evolutionary algorithm to optimize the hyperparameters of a Gradient Boosting model. An evolutionary approach was chosen due to its ability to efficiently search large hyperparameter spaces within a reasonable timeframe by utilizing mechanisms inspired by natural evolution. The implemented evolutionary algorithm utilized a population of 20 individuals and ran for 10 generations. Each individual represented a set of hyperparameters, specifically optimizing the number of estimators (n_estimators), learning rate, maximum tree depth (max_depth), the proportion of samples used for training each tree (subsample), and the minimum number of samples required to split an internal node (min_samples_split).

The search space was defined as follows: n_estimators ∈ [50, 1000], learning_rate ∈ [0.01, 0.3], max_depth ∈ [3, 15], subsample ∈ [0.5, 1.0], and min_samples_split ∈ [2, 10]. These ranges were selected based on standard practices in the Gradient Boosting literature and confirmed through preliminary experiments.

The optimization process is illustrated in Figure 8, which shows the evolution of the average and best mean absolute percentage error (MAPE) scores across generations. There is a clear downward trend in the average population error and rapid stabilization of the best solution after just a few generations.

The best hyperparameters obtained were n_estimators = 581, learning_rate = 0.142, max_depth = 4, subsample = 0.981, and min_samples_split = 10, resulting in an MAPE score of 6.27%.

Table 6 complements Figure 8 by quantifying the statistical significance of the observed reductions in the MAPE. For the population average curve, the MAPE decreased from 0.07 to 0.06 (−0.30 pp), with a strong negative monotonic trend (Kendall’s τ = −0.78, p < 0.001) and a marginally significant early–late contrast (U = 9, p = 0.05). For the best-of-generation curve, the trend was also significant (τ = −0.58, p = 0.02), but the early–late contrast was not (U = 6, p = 0.33), reflecting the mid-run plateau seen in Figure 8. These results confirm that the optimization consistently improved performance, although most of the gain was achieved within the first generations.

Figure 9 illustrates the evolution of the sMAPE across generations of the evolutionary algorithm. A clear improvement in forecasting accuracy is observed during the initial iterations, followed by stabilization around 6.1% for the average error and approximately 6.06% for the best individual in the population. When compared to the previously reported MAPE results (Figure 8), the conclusions remain consistent—both relative error measures lead to the same insights. This confirms that the results are robust to the choice of error metric variant and are not an artefact of a particular indicator definition.

3.4. Comparison with a Distance-Only Baseline

To further assess the contribution of the proposed approach, an ablation study was conducted. A simple linear regression using only the distance variable (distance-only baseline) was employed as the reference model. The comparison revealed a substantial performance gap: the baseline achieved a mean MAPE of 17.27% (±1.23), whereas the full model with selected features and tuned hyperparameters of the Gradient Boosting Regressor reduced the error to 6.27% (±0.65). These results clearly demonstrate that advanced feature engineering and hyperparameter optimization substantially enhance predictive accuracy compared with a naive distance-only baseline (Figure 10).

4. Discussion

This study provides important insights into the operational and financial dynamics of freight logistics throughout Europe by utilizing a sizable, real-world dataset with 45,569 transport orders and 52 variables. An important methodological development that improves model interpretability, accuracy, and efficiency is the incorporation of evolutionary algorithms for feature selection and hyperparameter optimization.

With a low mean absolute percentage error (MAPE) of 6.27%, the Gradient Boosting model’s strong predictive performance validates that ensemble learning techniques are appropriate for capturing the high-dimensional and nonlinear relationships common in freight pricing data. Prior research has also demonstrated that Gradient Boosting performs better on complicated regression tasks, especially when it comes to logistics and transportation [15,72]. As suggested in time-aware predictive modelling frameworks, the use of nested cross-validation with a TimeSeriesSplit strategy guaranteed the preservation of temporal dependencies within the dataset, reducing the risk of data leakage and overfitting [73,74].

A key factor in improving model performance was feature engineering. The model was able to capture the main cost drivers in freight transportation by building domain-specific features like TOTAL_KM, TOTAL_PRICE, and PRICE_PER_KM. The remarkably high correlation (r = 0.92) between TOTAL_KM and EUR is consistent with the body of research on the importance of distance in freight cost estimation. The intuitive knowledge that shipment complexity and logistical load have a major impact on final pricing is further supported by the moderate correlations between QTY_LOADS and QTY_DELIVERIES. Additionally, the use of aggregated categorical features, like the average price per kilometre by vehicle or cargo type, is similar to methods that others have found to be successful in transportation cost modelling [75,76].

In addition to qualitative interpretation, several numerical indicators confirm the robustness of the obtained results. The predictive accuracy achieved by the Gradient Boosting model reached an MAPE of 6.27% (±0.65), while a simple distance-only baseline remained at 17.27% (±1.23), indicating an improvement of more than 11 percentage points. Furthermore, nested cross-validation revealed stable performance, with fold-level errors ranging from 6.76% to 8.34% and an average of 7.54%. The symmetric sMAPE metric converged to approximately 6.1%, confirming that the results are not an artefact of a single error definition. These quantitative measures substantiate the observed feature importance patterns and provide a solid statistical foundation for the qualitative findings discussed above.

This study’s dual application of evolutionary algorithms for feature and hyperparameter optimization is among its most significant contributions. By successfully reducing the original feature set from 57 to 21–25 variables per fold, this biologically inspired method enhanced model clarity and computational efficiency. This is in line with [77], which showed that, when working with big, noisy, and interdependent datasets, evolutionary strategies perform better than conventional selection techniques. The significance of incorporating operational, environmental, and temporal dimensions into freight pricing models is highlighted by the most commonly chosen features, including TOTAL_KM, TEMP_MIN/MAX, and time-aggregated variables.

It is necessary to recognize certain limitations despite the excellent performance and methodological rigor. Even when carefully imputed, missing values may still introduce uncertainty, especially in features like payment terms or document delivery method that might reflect organizational behaviours that are not clear. Furthermore, although the current model does a good job of generalizing within the historical dataset, it does not explicitly model external shocks like fluctuations in fuel prices, geopolitical upheavals, or regulatory changes. According to research in supply chain risk modelling, adding real-time external indicators (such as fuel indices, congestion metrics, or economic forecasts) could improve predictive robustness even more [78].

5. Conclusions

This study presents a robust and comprehensive framework for forecasting road freight transport rates using advanced machine learning techniques. The model achieved a high degree of accuracy with a mean absolute percentage error (MAPE) as low as 6.27% by using a Gradient Boosting algorithm that was improved through evolutionary feature selection and hyperparameter optimization. The effectiveness of this method emphasizes how crucial it is to handle high-dimensional and heterogeneous transport data with careful data preprocessing, context-aware feature engineering, and evolutionary algorithms.

According to the research, temporal and categorical aggregations, temperature constraints (TEMP_MIN/MAX), and total transport distance (TOTAL_KM) are important predictors that have a significant impact on transport pricing. These results offer practical insights for freight planners, pricing analysts, and transport platform developers, in addition to being in line with current logistics theory.

This research has practical implications for logistics industry stakeholders, in addition to its technical contributions. Analyses of transport efficiency, cost benchmarking, and dynamic pricing can all be supported by the final model. Because of its interpretability and flexibility, it can also be incorporated into larger logistics optimization frameworks and real-time decision support systems.

In practical terms, the proposed model offers clear benefits for stakeholders in the road freight industry. A carrier can estimate the expected import cost while quoting an export service, a freight manager can identify transport orders that are systematically overpriced by their staff, and a manufacturer can verify whether the company is overpaying for logistics services. These examples demonstrate how predictive analytics can directly support more transparent pricing, improve negotiation processes, and enhance cost efficiency in day-to-day operations.

Future research should explore integrating external data sources, such as fuel price indices, weather conditions, and real-time traffic data, to improve responsiveness to dynamic market conditions. The model’s generalizability would also be tested, and its usefulness across the global supply chain landscape would be increased by extending its application to other transport modalities (such as rail freight or maritime freight) and geographical areas.

This study concludes that domain knowledge, evolutionary machine learning, and structured analytical workflows can be combined to create high-performing, interpretable, and practically applicable models for freight transportation price prediction. This is a crucial ability in the fast-paced, cost-sensitive logistics environment of today.

Author Contributions

Conceptualization, A.B. and M.C.; Methodology, A.B. and M.C.; Validation, A.B. and M.C.; Formal analysis, A.B. and M.C.; Investigation, A.B. and M.C.; Resources, A.B. and M.C.; Data curation, A.B. and M.C.; Writing – review & editing, A.B. and M.C.; Visualization, A.B. and M.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article. The Jupyter notebooks used in this study are available in open access at https://github.com/BudzynskiA/road-freight-price-prediction (accessed on 28 July 2025).

Acknowledgments

The authors express their gratitude to the reviewers for their insightful and constructive feedback, which has improved the paper’s quality and will aid them in progressing their research in this field.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mrabti, N.; Hamani, N.; Delahoche, L. The Pooling of Sustainable Freight Transport. J. Oper. Res. Soc. 2021, 72, 2180–2195. [Google Scholar] [CrossRef]
Meidutė-Kavaliauskienė, I.; Stanujkic, D.; Vasiliauskas, A.V.; Vasilienė-Vasiliauskienė, V. Significance of Criteria and Resulting Significance of Factors Affecting Quality of Services Provided by Lithuanian Road Freight Carriers. Procedia Eng. 2017, 187, 513–519. [Google Scholar] [CrossRef]
Kummer, S.; Dieplinger, M.; Fürst, E. Flagging out in Road Freight Transport: A Strategy to Reduce Corporate Costs in a Competitive Environment. J. Transp. Geogr. 2014, 36, 141–150. [Google Scholar] [CrossRef]
Patel, Z.; Ganu, M.; Kharosekar, R.; Hake, S. Technical paper on dynamic pricing model for freight transportation services. TIJER-Int. Res. J. 2023, 10, 331–337. [Google Scholar]
Kopalle, P.K.; Pauwels, K.; Akella, L.Y.; Gangwar, M. Dynamic Pricing: Definition, Implications for Managers, and Future Research Directions. J. Retail. 2023, 99, 580–593. [Google Scholar] [CrossRef]
Friesz, T.L.; Lin, C.C. Dynamic Spatial Price Equilibrium, Nonlinear Freight Pricing, and Alternative Mathematical Formulations. Netw. Spat. Econ. 2025. [Google Scholar] [CrossRef]
Miller, J.W. ARIMA Time Series Models for Full Truckload Transportation Prices. Forecasting 2018, 1, 121–134. [Google Scholar] [CrossRef]
Kaitouni, O.; Hajjar, B. The Application of the ARIMA Model for Time Series Air Freight Forecasting. Int. J. Logist. Syst. Manag. 2023, 1, 1. [Google Scholar] [CrossRef]
Munim, Z.H. State-Space TBATS Model for Container Freight Rate Forecasting with Improved Accuracy. Marit. Transp. Res. 2022, 3, 100057. [Google Scholar] [CrossRef]
Schmid, L.; Roidl, M.; Kirchheim, A.; Pauly, M. Comparing Statistical and Machine Learning Methods for Time Series Forecasting in Data-Driven Logistics—A Simulation Study. Entropy 2024, 27, 25. [Google Scholar] [CrossRef] [PubMed]
Liachovičius, E.; Šabanovič, E.; Skrickij, V. Freight rate and demand forecasting in road freight transportation using econometric and artificial intelligence methods. Transport 2023, 38, 231–242. [Google Scholar] [CrossRef]
Wu, H.; Gong, C. Modeling the Ningbo Container Freight Index Through Deep Learning: Toward Sustainable Shipping and Regional Economic Resilience. Sustainability 2025, 17, 4655. [Google Scholar] [CrossRef]
Shoukat, R. Economic Impact, Design, and Significance of Intermodal Freight Distribution in Pakistan. Eur. Transp./Trasp. Eur. 2022, 1–14. [Google Scholar] [CrossRef]
Stojanović, Đ. Road Freight Transport Outsourcing Trend in Europe—What Do We Really Know about It? Transp. Res. Procedia 2017, 25, 772–793. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Izadi, A.; Nabipour, M.; Titidezh, O. Cost Models and Cost Factors of Road Freight Transportation: A Literature Review and Model Structure. Fuzzy Inf. Eng. 2020, 11, 257–278. [Google Scholar] [CrossRef]
Ślusarczyk, B. Logistics Challenges: Global and Local Approaches, 1st ed.; Politechnika Częstochowska: Częstochowa, Poland, 2024; ISBN 978-83-65976-11-6. [Google Scholar]
Levinson, D.M.; Corbett, M.J.; Hashami, M. Operating Costs for Trucks. 2005. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1736159 (accessed on 7 July 2025).
Nash, C.; Sansom, T. Pricing European Transport Systems: Recent Developments and Evidence from Case Studies. J. Transp. Econ. Policy (JTEP) 2001, 35, 363–380. [Google Scholar]
ATRI. An Analysis of the Operational Costs of Trucking: 2023 Update. Available online: https://www.ams.usda.gov/sites/default/files/media/FMMO_NMPF_53A.pdf (accessed on 7 July 2025).
Abate, M. Determinants of Capacity Utilisation in Road Freight Transportation. J. Transp. Econ. Policy (JTEP) 2014, 48, 137–152. [Google Scholar]
Galkin, A.; Olkhova, M.; Iwan, S.; Kijewska, K.; Ostashevskyi, S.; Lobashov, O. Planning the Rational Freight Vehicle Fleet Utilization Considering the Season Temperature Factor. Sustainability 2021, 13, 3782. [Google Scholar] [CrossRef]
Samchuk, G.; Kopytkov, D.; Rossolov, A. Freight Fleet Management Problem: Evaluation of a Truck Utilization Rate Based on Agent Modeling. Komunikácie 2022, 24, D46–D58. [Google Scholar] [CrossRef]
Zgonc, B.; Tekavčič, M.; Jakšič, M. The Impact of Distance on Mode Choice in Freight Transport. Eur. Transp. Res. Rev. 2019, 11, 10. [Google Scholar] [CrossRef]
Álvarez, P.; Serrano-Hernandez, A.; Lerga, I.; Faulin, J. Optimizing Freight Delivery Routes: The Time-Distance Dilemma. Transp. Res. Part. A Policy Pract. 2024, 190, 104283. [Google Scholar] [CrossRef]
Mommens, K.; Van Lier, T.; Macharis, C. Multimodal Choice Possibilities for Different Cargo Types: Application to Belgium. Res. Transp. Bus. Manag. 2020, 37, 100528. [Google Scholar] [CrossRef]
Mommens, K.; Van Lier, T.; Macharis, C. Loading Unit in Freight Transport Modelling. Procedia Comput. Sci. 2016, 83, 921–927. [Google Scholar] [CrossRef]
De Jong, G. Value of Time in Freight Transport. In International Encyclopedia of Transportation; Elsevier: Amsterdam, The Netherlands, 2021; pp. 321–325. [Google Scholar] [CrossRef]
Sert, E.; Hedayatifar, L.; Rigg, R.A.; Akhavan, A.; Buchel, O.; Saadi, D.E.; Kar, A.A.; Morales, A.J.; Bar-Yam, Y. Freight Time and Cost Optimization in Complex Logistics Networks. Complexity 2020, 2020, 1–11. [Google Scholar] [CrossRef]
Stoop, K.; Pickavet, M.; Colle, D.; Audenaert, P. Selective Backhauls in Truck Transport with Risk Mitigation: Large Belgian Retailer Case Study. Netw. Spat. Econ. 2024, 24, 99–130. [Google Scholar] [CrossRef]
Guerrero, D.; Itoh, H.; Tsubota, K. Freight Rates up and down the Urban Hierarchy. Res. Transp. Bus. Manag. 2022, 45, 100775. [Google Scholar] [CrossRef]
Gohari, A.; Matori, N.; Wan Yusof, K.; Toloue, I.; Cho Myint, K. Effects of the Fuel Price Increase on the Operating Cost of Freight Transport Vehicles. E3S Web Conf. 2018, 34, 01022. [Google Scholar] [CrossRef]
Winebrake, J.J.; Green, E.H.; Comer, B.; Li, C.; Froman, S.; Shelby, M. Fuel Price Elasticities in the U.S. Combination Trucking Sector. Transp. Res. Part. D Transp. Environ. 2015, 38, 166–177. [Google Scholar] [CrossRef]
Santos, G. Road Fuel Taxes in Europe: Do They Internalize Road Transport Externalities? Transp. Policy 2017, 53, 120–134. [Google Scholar] [CrossRef]
Carlan, V.; Sys, C.; Vanelslander, T. Innovation in Road Freight Transport: Quantifying the Environmental Performance of Operational Cost-Reducing Practices. Sustainability 2019, 11, 2212. [Google Scholar] [CrossRef]
Friedt, F.L.; Wilson, W.W. Trade, Transport Costs and Trade Imbalances: An Empirical Examination of International Markets and Backhauls. Can. J. Econ. /Rev. Can. D’économique 2020, 53, 592–636. [Google Scholar] [CrossRef]
Raju, T.B.; Chauhan, P.; Tiwari, S.; Kashav, V. Seasonality in Freight Rates. J. Int. Logist. Trade 2020, 18, 149–157. [Google Scholar] [CrossRef]
Chu, H.C. Effects of Extreme Weather and Economic Factors on Freight Transportation. Adv. Manag. Appl. Econ. 2016, 6, 113. [Google Scholar]
Ahmady, M.; Eftekhari Yeghaneh, Y. Optimizing the Cargo Flows in Multi-Modal Freight Transportation Network Under Disruptions. Iran. J. Sci. Technol. Trans. Civ. Eng. 2022, 46, 453–472. [Google Scholar] [CrossRef]
Monge, M.; Romero Rojo, M.F.; Gil-Alana, L.A. The Impact of Geopolitical Risk on the Behavior of Oil Prices and Freight Rates. Energy 2023, 269, 126779. [Google Scholar] [CrossRef]
Marcucci, E.; Gatta, V.; Simoni, M.; Maltese, I. Pricing in Freight Transport. In Handbook on Transport Pricing and Financing; Edward Elgar Publishing: Cheltenham, UK, 2023; pp. 229–251. [Google Scholar]
Widodo, A.; Budi, I.; Widjaja, B. Automatic Lag Selection in Time Series Forecasting Using Multiple Kernel Learning. Int. J. Mach. Learn. Cyber. 2016, 7, 95–110. [Google Scholar] [CrossRef]
Mrowczynska, B.; Ciesla, M.; Krol, A.; Sladkowski, A. Application of Artificial Intelligence in Prediction of Road Freight Transportation. PROMET 2017, 29, 363–370. [Google Scholar] [CrossRef]
Züfle, M.; Kounev, S. A Framework for Time Series Preprocessing and History-Based Forecasting Method Recommendation. In Proceedings of the Annals of Computer Science and Information Systems, Sofia, Bulgaria, 6–9 September 2020; Volume 21, pp. 141–144. [Google Scholar]
Martínez, F.; Frías, M.P.; Pérez, M.D.; Rivera, A.J. A Methodology for Applying K-Nearest Neighbor to Time Series Forecasting. Artif. Intell. Rev. 2019, 52, 2019–2037. [Google Scholar] [CrossRef]
Kim, N.; Cha, J.; Jeon, J. A Comparative Evaluation of Machine Learning Approaches for Container Freight Rates Prediction. Asian J. Shipp. Logist. 2025, 41, 99–109. [Google Scholar] [CrossRef]
Guo, H.; Zhang, Y.; Yu, Y.; Wang, L.; Jia, P.; Pedrycz, W. A Decomposition–Integration Interval Prediction Strategy for Iron Ore Shipping Freight Rates with Reinforcement Learning. Eng. Appl. Artif. Intell. 2025, 144, 111442. [Google Scholar] [CrossRef]
Kjeldsberg, F.; Haque Munim, Z. Automated Machine Learning Driven Model for Predicting Platform Supply Vessel Freight Market. Comput. Ind. Eng. 2024, 191, 110153. [Google Scholar] [CrossRef]
Saeed, N.; Nguyen, S.; Cullinane, K.; Gekara, V.; Chhetri, P. Forecasting Container Freight Rates Using the Prophet Forecasting Method. Transp. Policy 2023, 133, 86–107. [Google Scholar] [CrossRef]
Wang, W.; He, N.; Chen, M.; Jia, P. Freight Rate Index Forecasting with Prophet Model Based on Multi-Dimensional Significant Events. Expert. Syst. Appl. 2024, 249, 123451. [Google Scholar] [CrossRef]
Lechtenberg, S.; Hellingrath, B. Guiding Practitioners of Road Freight Transport to Implement Machine Learning for Operational Planning Tasks. Transp. Res. Procedia 2025, 82, 1839–1857. [Google Scholar] [CrossRef]
Budak, A.; Sarvari, P.A. Profit Margin Prediction in Sustainable Road Freight Transportation Using Machine Learning. J. Clean. Prod. 2021, 314, 127990. [Google Scholar] [CrossRef]
Yin, K.; Guo, H.; Yang, W. A Novel Real-Time Multi-Step Forecasting System with a Three-Stage Data Preprocessing Strategy for Containerized Freight Market. Expert. Syst. Appl. 2024, 246, 123141. [Google Scholar] [CrossRef]
El Ouadi, J.; Malhene, N.; Benhadou, S.; Medromi, H. Towards a Machine-Learning Based Approach for Splitting Cities in Freight Logistics Context: Benchmarks of Clustering and Prediction Models. Comput. Ind. Eng. 2022, 166, 107975. [Google Scholar] [CrossRef]
Johnson, P.M.; Barbour, W.; Camp, J.V.; Baroud, H. Using Machine Learning to Examine Freight Network Spatial Vulnerabilities to Disasters: A New Take on Partial Dependence Plots. Transp. Res. Interdiscip. Perspect. 2022, 14, 100617. [Google Scholar] [CrossRef]
Budzyński, A. Forecasting Prices for Road Freight Transport Services Using Machine Learning. Ph.D. Thesis, Silesian University of Technology, Katowice, Poland, 2025. Available online: https://bip.polsl.pl/wp-content/uploads/sites/4/2025/02/Praca-doktorska-mgr-inz.-Artur-Budzynski.pdf (accessed on 15 July 2025).
Budzyński, A.; Cieśla, M. Application of a machine learning model for forecasting freight rate in road transport. Sci. J. Silesian Univ. Technol. Ser. Transp. SJSUT ST 2025, 126, 23–48. [Google Scholar] [CrossRef]
Budzyński, A.; Sładkowski, A. Machine Learning in Road Freight Transport Management. In Using Artificial Intelligence to Solve Transportation Problems; Sładkowski, A., Ed.; Studies in Systems, Decision and Control; Springer Nature: Cham, Switzerland, 2024; Volume 563, pp. 485–565. ISBN 978-3-031-69486-8. [Google Scholar]
Python. Available online: https://www.python.org (accessed on 23 July 2025).
Kluyver, T.; Ragan-Kelley, B.; Pérez, F.; Granger, B.E.; Bussonnier, M.; Frederic, J.; Kelley, K.; Hamrick, J.B.; Grout, J.; Corlay, S.; et al. Jupyter Notebooks—A Publishing Format for Reproducible Computational Workflows. In Proceedings of the International Conference on Electronic Publishing, Göttingen, Germany, 7–9 June 2016. [Google Scholar]
McKinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 56–61. [Google Scholar]
Harris, C.R.; Millman, K.J.; Van Der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array Programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
Hunter, J.D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
Python Holidays Library Documentation. Available online: https://pypi.org/project/holidays/ (accessed on 23 July 2025).
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Statist. 2001, 29. [Google Scholar] [CrossRef]
Scikit-Learn: Gradient Boosting Documentation. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html (accessed on 23 July 2025).
Armstrong, J.S.; Collopy, F. Error Measures for Generalizing about Forecasting Methods: Empirical Comparisons. Int. J. Forecast. 1992, 8, 69–80. [Google Scholar] [CrossRef]
Fortin, F.-A.; Rainville, F.-M.D.; Gardner, M.-A.; Parizeau, M.; Gagné, C. DEAP: Evolutionary Algorithms Made Easy. J. Mach. Learn. Res. 2012, 13, 2171–2175. [Google Scholar]
Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian Optimization of Machine Learning Algorithms. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Pereira, F., Burges, C.J., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
Montavon, G.; Samek, W.; Müller, K.-R. Methods for Interpreting and Understanding Deep Neural Networks. Digit. Signal Process. 2018, 73, 1–15. [Google Scholar] [CrossRef]
Hlavatý, R.; Brozova, H. Robust Optimization Approach in Transportation Problem. In Proceedings of the 35th International Conference Mathematical Methods in Economics, Hradec Králové, Czech Republic, 16–17 September 2017; pp. 225–230. [Google Scholar]
Varma, S.; Simon, R. Bias in Error Estimation When Using Cross-Validation for Model Selection. BMC Bioinform. 2006, 7, 91. [Google Scholar] [CrossRef]
Ranganathan, S.; Gribskov, M.; Nakai, K.; Schönbach, C. (Eds.) Encyclopedia of Bioinformatics and Computational Biology; Elsevier: Amsterdam, The Netherlands, 2019; ISBN 978-0-12-811432-2. [Google Scholar]
Chou, J.-S. Generalized Linear Model-Based Expert System for Estimating the Cost of Transportation Projects. Expert. Syst. Appl. 2009, 36, 4253–4267. [Google Scholar] [CrossRef]
Bodendorf, F.; Merkl, P.; Franke, J. Intelligent Cost Estimation by Machine Learning in Supply Management: A Structured Literature Review. Comput. Ind. Eng. 2021, 160, 107601. [Google Scholar] [CrossRef]
Abd-Alsabour, N. A Review on Evolutionary Feature Selection. In Proceedings of the 2014 European Modelling Symposium, Pisa, Italy, 21–23 October 2014; pp. 20–26. [Google Scholar]
Ivanov, D.; Dolgui, A. Viability of Intertwined Supply Networks: Extending the Supply Chain Resilience Angles towards Survivability. A Position Paper Motivated by COVID-19 Outbreak. Int. J. Prod. Res. 2020, 58, 2904–2915. [Google Scholar] [CrossRef]

Figure 1. Research workflow diagram.

Figure 2. Top features by percentage of missing values.

Figure 3. Feature extraction pipeline diagram.

Figure 4. Distribution of the target variable price [EUR].

Figure 5. Pearson correlation heatmap with categorical strength bands; asterisks denote t-test significance (* p < 0.05, *** p < 0.001).

Figure 6. MAPE values across cross-validation folds.

Figure 7. Frequency of feature selection by evolutionary algorithm across CV folds.

Figure 8. Evolution of MAPE across generations.

Figure 9. Evolution of sMAPE across generations.

Figure 10. Ablation: MAPE ± 1 SD—full vs. distance only.

Table 1. Key features influencing road freight rates.

Type	Feature	Explanation	Reference
Internal	Labour	Higher wages and benefits increase operational costs.	Levinson et al. [18], Nash and Sinsom [19]
	Vehicle type and capacity utilization	Improved capacity usage reduces the cost per unit, whereas trucks that are under-utilized raise expenses.	ATRI [20], Abate [21], Galkin et al. [22], Samchuk et al. [23]
	Distance and route efficiency	Longer distances and difficult routes increase time and fuel consumption, raising rates.	Zgonc et al. [24], Álvarez et al. [25]
	Type of cargo	Perishables, hazardous materials, or high-value goods require special handling, leading to higher freight charges.	Mommens et al. [26,27]
	Delivery speed and reliability	Express deliveries cost more due to prioritization and tighter scheduling.	De Jong [28], Sert et al. [29]
	Backhaul opportunities	Availability of return cargo (backhaul) affects rate optimization.	Stoop et al. [30], Guerrero et al. [31]
External	Fuel prices	Rising fuel costs elevate transportation expenses, frequently transferred to consumers via fuel surcharges.	Gohari et al. [32], Winebrake [33]
	Regulatory and legal factors	Tolls, taxes, emissions regulations, and labour laws increase freight costs.	Santos [34], Carlan et al. [35]
	Market demand and supply (capacity)	Freight rates rise during peak seasons or when trucks are in short supply.	Friedt and Wilson [36]
	Seasonality and weather conditions	Adverse weather and seasonal peaks (e.g., holidays, harvests) impact availability and cost.	Raju et al. [37], Chu [38]
	Geopolitical events and disruptions	Unforeseen events can severely constrain capacity or increase risks (e.g., COVID-19).	Ahmady et al. [39], Monge et al. [40], Marcucci et al. [41]

Table 2. Machine learning applications in latest freight transportation research.

Year	Research Works	Domain/Topic	ML Models Used	Key Findings	Specifics
2025	Kim et al. [46]	Container Freight Rate Prediction	Decision tree, Random Forest, LSTM, Prophet	Assesses the predictive capabilities of four models in estimating container freight rates. The decision tree demonstrated greater relative precision.	Concentrates on predicting container shipping costs for informed decision-making in the maritime sector.
2024	Guo et al. [47]	Iron Ore Shipping Freight Rate Prediction	Reinforcement learning (Q-learning), Quantile Regression Neural Network (QRNN), GARCH, Extreme Learning Machine (ELM), Long Short-Term Memory (LSTM), STL (Seasonal–Trend decomposition using Loess)	Suggests a “decomposition–integration” approach for forecasting intervals of iron ore shipping freight rates. Integrates decomposition techniques with reinforcement learning to dynamically adjust the weighting of prediction outputs, enhancing precision and dependability for fluctuating freight rate data.	Focuses on handling the volatility and non-stationarity of iron ore shipping freight rate data for investment decisions and risk management.
2024	Kjeldsberg and Haque Munim [48]	Platform Supply Vessel (PSV) Freight Market Prediction	Automated Machine Learning (AutoML) frameworks, Eureqa Generalized Additive Model, eXtreme Gradient Boosted Trees Regressor, Ridge Regressor with Forecast Distance Modelling	Explores AutoML for forecasting PSV time charter freight rates. Identifies 43 relevant influencing factors and tests 79 complex ML models.	Focuses on predicting PSV time charter freight rates for offshore oil and gas industry.
2023	Saeed et al. [49]	Container Freight Rate Forecasting	Prophet, Natural Language Processing (NLP) with zero-shot learning	Applies Prophet forecasting by incorporating categorized disruptive events (extracted using ML and NLP) to improve accuracy on six major container routes.	Concentrates on the impact of major disruptive incidents on container freight rates and how to integrate them into predictive models.
2024	Wang et al. [50]	Baltic Dry Index (BDI) Forecasting	Prophet model	Investigates BDI forecasting by considering the impact of multi-dimensional significant events using the Prophet model. Establishes a “significant event database”.	Aims to support shipping market stakeholders in understanding risks and making informed decisions regarding international dry bulk shipping rates.
2025	Lechtenberg and Hellingrath [51]	Implementation of ML in Road Freight Transport Operational Planning	Not specific models but discusses ML implementation in general	Provides direction for professionals to recognize and execute machine learning applications for operational planning activities in road freight transport, tackling issues such as limited ML expertise.	Concentrates on connecting the divide between theoretical ML capabilities and real-world implementation in road freight operational planning.
2021	Budak and Sarvari [52]	Profit Margin Prediction in Sustainable Road Freight Transportation	ML (general), hybrid ML methodologies	Predicts profit margin for freight trucking in sustainable road transportation. Determines variables affecting profit margin and provides a decision support model for managers.	Aims to create sustainable road freight transport strategies and assist managers in making decisions concerning profit margins within this framework.
2024	Yin et al. [53]	ML Applications in Freight and Logistics Research	Extreme Learning Machine (ELM), Convolutional Neural Network (CNN)	Novel real-time multi-step forecasting system with 3-stage data preprocessing; superior for streaming data with concept drift.	Primarily focuses on the China Containerized Freight Index (CCFI) as a crucial indicator for container freight rates and the global shipping market
2022	El Ouadi ate al. [54]	Urban Logistics, Freight Consolidation	K-means (clustering), Support Vector Machine (SVM) (forecasting)	Sequential ML approach (clustering + forecasting) for urban land splitting; K-means for clustering, SVM most efficient for forecasting.	The study directly addresses urban logistics challenges, such as freight consolidation and zoning.
2022	Johnson et al. [55]	Freight Network Vulnerability	Gradient Boosting Machines	Simulation-based approach using partial dependence plots (from GBM) to infer spatial vulnerabilities to area-spanning disruptions.	Examines the US multimodal freight transportation network and its susceptibility to various disruptions, such as extreme weather events

Table 3. Raw data overview.

Column Name	Data Type
AT_KM, BE_KM, CZ_KM, DE_KM, DK_KM, EE_KM, ES_KM, FI_KM, HR_KM, FR_KM, HU_KM, IT_KM, LT_KM, LV_KM, NL_KM, PL_KM, RO_KM, SE_KM, SI_KM, SK_KM	float64
COD_LP, COD_DP, ROUTE_TYPE, START_LOAD_TIME, END_LOAD_TIME, START_DELIVERY_TIME, END_DELIVERY_TIME, VEHICLE_TYPE, BODY_TYPE, LOAD_UNLOAD_METHOD, GOODS_TYPE, CARGO_TYPE, DOCUMENTS_BY	object
START_LOAD_DATE, END_LOAD_DATE, START_DELIVERY_DATE, END_DELIVERY_DATE, TIME_OF_ENTRY	datetime64 [ns]
EPALE, QTY_LOADS, QTY_DELIVERIES, CUSTOMS	int64
TEMP_MIN, TEMP_MAX, EUR, LDM, M3, HEIGHT, WIDTH, TONS, OTHER_COSTS, PAYMENT_TERM	float64

Table 4. Overview of data transformations and imputation methods applied.

Original Feature(s)	Transformation Operation	Resulting Feature(s)
Kilometre columns (AT_KM to SK_KM)	Calculation of percentage share of total kilometres (TOTAL_KM)	Percentage columns (AT_KM_PERC to SK_KM_PERC)
TEMP_MIN	Missing value imputation with minimum observed value	Imputed TEMP_MIN
TEMP_MAX	Missing value imputation with maximum observed value	Imputed TEMP_MAX
PAYMENT_TERM	Missing value imputation with mean observed value	Imputed PAYMENT_TERM
Other features	Missing value imputation with mean observed value	Imputed features

Table 5. Overview of data aggregation methods applied.

Original Feature(s)	Aggregation Operation	Resulting Feature(s)
COD_LP	Mean PRICE_PER_KM per loading point	COD_LP_MEAN_PRICE_PER_KM
COD_DP	Mean PRICE_PER_KM per delivery point	COD_DP_MEAN_PRICE_PER_KM
ROUTE_TYPE	Mean PRICE_PER_KM per route type	ROUTE_TYPE_MEAN_PRICE_PER_KM
VEHICLE_TYPE	Mean PRICE_PER_KM per vehicle type	VEHICLE_TYPE_MEAN_PRICE_PER_KM
BODY_TYPE	Mean PRICE_PER_KM per body type	BODY_TYPE_MEAN_PRICE_PER_KM
LOAD_UNLOAD_METHOD	Mean PRICE_PER_KM per load/unload method	LOAD_UNLOAD_METHOD_MEAN_PRICE_PER_KM
GOODS_TYPE	Mean PRICE_PER_KM per goods type	GOODS_TYPE_MEAN_PRICE_PER_KM
CARGO_TYPE	Mean PRICE_PER_KM per cargo type	CARGO_TYPE_MEAN_PRICE_PER_KM
DOCUMENTS_BY	Mean PRICE_PER_KM per document delivery method	DOCUMENTS_BY_MEAN_PRICE_PER_KM
START_LOAD_TIME	Mean PRICE_PER_KM per rounded loading start hour	START_LOAD_TIME_MEAN_PRICE_PER_KM
END_LOAD_TIME	Mean PRICE_PER_KM per rounded loading end hour	END_LOAD_TIME_MEAN_PRICE_PER_KM
START_DELIVERY_TIME	Mean PRICE_PER_KM per rounded delivery start hour	START_DELIVERY_TIME_MEAN_PRICE_PER_KM
END_DELIVERY_TIME	Mean PRICE_PER_KM per rounded delivery end hour	END_DELIVERY_TIME_MEAN_PRICE_PER_KM

Table 6. Statistical tests of MAPE reduction across generations during evolutionary hyperparameter optimization.

Curve	Start MAPE	End MAPE	Drop (pp)	Kendall τ	p (Kendall)	Mann–Whitney U	p (U-Test)
AVG	0.07	0.06	0.30	−0.78	0.00	9.00	0.05
BEST	0.06	0.06	0.07	−0.58	0.02	6.00	0.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Budzyński, A.; Cieśla, M. Enhancing Road Freight Price Forecasting Using Gradient Boosting Ensemble Supervised Machine Learning Algorithm. Mathematics 2025, 13, 2964. https://doi.org/10.3390/math13182964

AMA Style

Budzyński A, Cieśla M. Enhancing Road Freight Price Forecasting Using Gradient Boosting Ensemble Supervised Machine Learning Algorithm. Mathematics. 2025; 13(18):2964. https://doi.org/10.3390/math13182964

Chicago/Turabian Style

Budzyński, Artur, and Maria Cieśla. 2025. "Enhancing Road Freight Price Forecasting Using Gradient Boosting Ensemble Supervised Machine Learning Algorithm" Mathematics 13, no. 18: 2964. https://doi.org/10.3390/math13182964

APA Style

Budzyński, A., & Cieśla, M. (2025). Enhancing Road Freight Price Forecasting Using Gradient Boosting Ensemble Supervised Machine Learning Algorithm. Mathematics, 13(18), 2964. https://doi.org/10.3390/math13182964

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Road Freight Price Forecasting Using Gradient Boosting Ensemble Supervised Machine Learning Algorithm

Abstract

1. Introduction

1.1. Background and Motivation

1.2. Research Objectives and Significance

1.3. Research Gap

1.4. Manuscript Structure

2. Materials and Methods

2.1. Data

2.2. Software

2.3. Research Workflow

2.4. Missing Values

2.5. Feature Engineering

2.5.1. Feature Extraction

2.5.2. Feature Transformation and Imputation

2.5.3. Aggregation

2.6. Modelling

2.7. Evolutionary Feature Selection

2.8. Evolutionary Hyperparameter Optimization

2.9. Ablation Study

3. Results

3.1. EDA

3.2. Evolutionary Feature Selection with Nested Cross-Validation

3.3. Results of Evolutionary Hyperparameter Optimization

3.4. Comparison with a Distance-Only Baseline

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI