Next Article in Journal
An Empirical Investigation of Personalized Recommendation and Reward Effect on Customer Behavior: A Stimulus–Organism–Response (SOR) Model Perspective
Previous Article in Journal
Crowd Evacuation through Crossing Configurations: Effect of Crossing Angles and Walking Speeds on Speed Variation and Evacuation Time
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparative Study of Machine Learning Algorithms for Industry-Specific Freight Generation Model

Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
*
Author to whom correspondence should be addressed.
Sustainability 2022, 14(22), 15367; https://doi.org/10.3390/su142215367
Submission received: 7 October 2022 / Revised: 26 October 2022 / Accepted: 16 November 2022 / Published: 18 November 2022
(This article belongs to the Section Sustainable Transportation)

Abstract

:
According to Bureau of Transportation Statistics, the U.S. transportation system handled 14,329 million ton-miles of freight per day in 2020. Understanding the generation of these freight shipments is crucial for transportation researchers, planners, and policymakers to design and plan for a more efficient and connected freight transportation system. Traditionally, the freight generation modeling has been based on Ordinary Least Square (OLS) regression, although more advanced Machine Learning (ML) algorithms have been evaluated and proven to have excellent performance in various transportation applications in recent years. Furthermore, one modeling approach applied for one industry might not always be applicable for another as their freight generation logics can be quite different. The objective of this study is to apply and evaluate alternative ML algorithms in the estimation of freight generation for each of 45 industry types. Seven alternative ML algorithms, along with the base OLS regression, were evaluated and compared. In addition, the study considered different combinations of variables in both the original and logarithmic form as well as hyperparameters of those ML algorithms in the model selection for each industry type. The results showed statistically significant improvements in the root mean square error reduction by the alternative ML algorithms over the OLS for over 80% of cases. The study suggests utilizing the alternative ML algorithms can reduce the root mean square error by about 30%, depending on industry types.

1. Introduction

Freight transportation is a critical link in the supply chain of goods. It connects industry productions to demands and directly or indirectly affects national and regional economic productivity and growth. Bureau of Transportation Statistics (BTS) indicates that the U.S. transportation system handled 14,329 million ton-miles of freight per day in 2020 [1]. Understanding the generation of these freights, where they originate from and terminate to, is crucial for freight transportation researchers, planners, and policymakers to design and plan for a more efficient and connected freight transportation system. Note that the term “freight generation”, commonly used in the transportation field, includes shipments both originated by (production) and terminated to (attraction) industry in this study.
In view of freight data needs, BTS initiated the quinquennial Commodity Flow Survey (CFS) since 1993 [2]. It is the only publicly available national survey in the U.S. on goods movement which provides national, state, and metropolitan-level data on freight shipments by industry sectors. The CFS data offers a comprehensive overview of the national freight generation and movement. Due to cost and other constraints, the CFS is conducted every five years and published data at state/metropolitan levels. Although CFS filled a large gap in freight data, the transportation communities have been expressing their desires for more timely data with granular geography for over a decade. To this end, freight generation models are frequently adopted by transportation analysts to estimate the quantity or value of goods generated from and/or attracted to a region. These models enable disaggregating the existing CFS data to local levels (e.g., county) and provide freight estimations for intermediate years between the CFS surveys.
This study utilized tonnage and value from the most recently released 2017 CFS data for 45 industry sectors as dependent variables and proposed industry-specific freight generation models based on industry-related factors such as number of establishments, annual payroll, number of employments, and receipt total. Traditionally, freight generation modeling approaches are based on Ordinary Least Square (OLS) regression [3,4,5]. While more complex Machine Learning (ML) algorithms have been evaluated and proved to have excellent performance in various transportation applications in recent years; based on the best of the authors’ knowledge, there has not been any research done on adopting alternative ML models in freight generation estimation. The objective of this study is, therefore, to apply and evaluate the alternative ML algorithms in freight generation estimation.
Seven alternative ML algorithms, along with OLS regression, were evaluated in this study. This research explored various combinations of variables in both original and logarithmic forms, algorithms, and corresponding hyperparameters. Then, the study proposed a selection method to choose the best combination by industry to generate industry-specific models. The selection method considers those alternative algorithms, other than OLS regression (the baseline approach), only when the improvement of model performance is statistically significant. If not significant, the OLS is selected as it has the advantage in terms of interpretability, compared to more complex ML algorithms.
The paper is structured in seven sections. The next section presents a literature review on general approaches and data sources used for freight generation modeling, as well as the application of ML models. Section 3. (Data Sources) summarizes the data used in this study. The ML algorithms adopted in this study and the baseline OSL regression are elaborated in the section after. The following section describes the data processing and model selection procedure. Then, Section 6. (Model Results) discusses the model performance results and summarizes the final model selection for each industry. The final section concludes this study.

2. Literature Review

There exist two major classes of freight generation models, in terms of dependent variables. The classes are freight generation (FG) and freight trip generation (FTG). The FG models often focus on the weight or value of freight (e.g., tons/year) whereas the FTG models focus on the number of freight vehicle trips (e.g., truck trips/year). FG models are a better representation of the regional- or national-level economic activities given the capability to reflect the intensity of production and consumption of goods. Table 1 summarizes past studies where FG is modeled as weight. Due to the scope of the current study, studies on FTG are not included in the review.
As for the scope of the analysis, all the FG models presented in Table 1 were estimated at a regional level. Some models were also estimated at industry- and commodity-specific levels. This aggregate modeling has the advantage of predicting FG from regional economic and other related characteristics. However, this approach may result in certain aggregation biases in the estimated FG data. The alternative is to model using disaggregated data. The estimation of these disaggregated models, however, requires establishment-level freight generation data. These data are often collected through surveys for the freight generating facilities in the study region.
Several explanatory variables were used for FG modeling in past studies. These include employment, establishment size, annual payroll by industry sector, gross floor area, population/population density, port influence, and land use. Among all these, employment is invariably considered as the most preferred explanatory variable. Establishment size and payroll are often considered along with employment. Several studies performed FG modeling on fixed variables without exploring the impact of variable selections on the model output or fit. Furthermore, none of the studies considered receipts total (an important economic characteristic at industry/establishment level) as an explanatory variable.
Additionally, the majority of the studies utilized the OLS regression to model FG due to its ability to explain the relationship between freight activity and explanatory variables, as coefficients of regressions directly represent impacts to model estimates (refer to Section 6.4). The other methods used in the literature are Spatial Regression, Multiple Classification Analysis, One-way ANOVA, and Spatial Autoregressive Model. All these statistical approaches make strict assumptions about the data. Furthermore, a number of the existing models estimated models where explanatory variables affect FG in a linear form which may not always be true [6]. Advanced ML algorithms are often a promising alternative to statistical approaches. The advantage of ML algorithms is that they learn to represent complex relationships in a data-driven manner and are often non-parametric. The usefulness of ML algorithms has already been demonstrated for different areas in transportation research. For instance, ML algorithms are particularly used in modeling travel mode choice [7], freight mode choice [8], crash severity prediction [9], predicting the performance of asphalt mixture [10], and freight demand forecasting [11]. Hagenauer and Helbich [7] conducted a comparative analysis of seven machine learning classifiers for modeling travel mode choice. Uddin et al. [8] explored eight machine learning classifiers, using 2012 Commodity Flow Survey data, for modeling freight mode choice. Iranitalab and Khattak [9] compared four statistical and machine learning methods for prediction of crash severity. Rahman et al. [10] explored machine learning methods to predict two metrics of the performance of the asphalt mixture. Lastly, Salais-Fierro and Martínez [11] applied machine learning methods for demand forecasting in freight transportation.
Compared to the referenced existing studies, the major contribution of this paper includes the followings:
  • Evaluation of seven commonly used ML algorithms (i.e., Lasso, Decision Tree, Random Forest, Gradient Boosting, Support Vector, Gaussian Process, and Multi-layer Perceptron regressions), along with Ordinary Least Square (OLS) regression, with statistical tests on model performance
  • Comprehensive scope of industry types—covered 45 North American Industry Classification System (NAICS) codes
  • Inclusion of receipts total as an exploratory variable
  • Industry-specific model selection—extensive model selection process considering model approach (ML algorithms), use of logarithm, and full combination of variable selection for each industry type

3. Data Sources

3.1. Dependent Variables—Freight Generation Data (Tonnage and Value)

The term “freight generation” is used differently in various studies. To clarify the FG modeling used in our study, “freight generation” is defined as the tonnage or value of freight shipments generated in a region associated with their business activities by each industry type. Note that our study does not estimate number of shipments or number of truck trips, which are considered FTG. As discussed in Section 2, FG models, compared to FTG, might better represent the regional- or national-level economic activities since they reflect the intensity of production and consumption of goods.
In addition, the term “freight generation” directly indicates that the study covered the estimation of shipments by both origins (freight production modeling) and destinations (freight attraction modeling). In freight planning, the “freight generation” is a prior process before the next step “freight distribution” (not covered by this study), which combines estimated shipments by origins and destinations and produces estimates of each origin-destination pair. The following two dependent variables are used for both the production (by origins) and the attraction (by destinations) modeling in our study, based on the 2017 CFS data [2].
  • tonnage: total weight (in thousand tons) of shipments originated from (terminated to) a region by industry
  • value: total value (in million dollars) of shipments originated from (terminated to) a region by industry
Note that there are variations associated with the sampling and other reporting errors that may have been incurred during the survey. Due to data confidentiality and data quality standard, Census suppressed tonnage and/or value for certain records in the public release of CFS data. Although there is another publicly available U.S. nationwide freight data, i.e., Freight Analysis Framework (FAF) [1], it was not considered in this study since the FAF data does not provide industry type information. The descriptions of all the 45 NAICS codes covered in this study are presented in Table A1.

3.2. Independent/Explanatory Variables—Economic/Industry Data

To develop reasonable FG models, many explanatory variables could be obtained from public/private data sources or derived using additional data processing. In this study, we used the indicators that represent economy or business activities, which are commonly used in the FG modeling studies. In addition, to potentially apply the FG models to disaggregate the CFS data, we need the input data at more granular level of geography (e.g., county). With such considerations, the study utilized the two county-level industry data products by Census, i.e., Economic Census (EC) [18] and County Business Pattern (CBP) data [19].
The CBP data, which is a part of the EC data program, are published annually between the five-year interval EC data releases. The main difference is that the EC tables provide additional business/economy information such as the receipt total (sales, revenue, or shipments) by industry, whereas the CBP provides number of employees, number of establishments, and annual payroll. All the in-scope industries in the CBP, as the name indicate, are provided at county level, whereas a few industries in the EC are provided at only state or selected geography level. As the study is to evaluate the FG modeling effort in terms of tonnage and value, only the industry types that were covered in the 2017 CFS data were considered in the EC and CBP tables as well. Among the NAICSs covered in our study, only two industry types, i.e., NAICS 212 (mining except oil and gas) and NAICS 551114 (corporate, subsidiary, and regional managing offices), do not have the receipt total at county level and therefore the variable receipt total was not included for the NAICS codes in our model selection process.
Like the 2017 CFS data, there is suppressed information in the CBP and EC tables as well. The imputation process is described in Section 5 (Data Processing and Model Selection). The following is a list of explanatory variables used in our study:
  • 2017 CBP: number of establishments (ESTAB), number of employments (EMP), and annual payroll (PAYANN)
  • 2017 EC: receipt total (RCPTOT) that is the total value of sales, revenue, or shipments

3.3. Shipments by Destinations (Freight Attraction)

The aforementioned explanatory and dependent variables are applied the same way for modeling both freight production (shipments by origins) and freight attraction (shipments by destinations), except for one additional step required for the freight attraction modeling. That is to derive origin-industry-specific input variables in respect to destinations, since the original CBP and EC data are provided by origin industries. The authors utilized the mapping of industry-to-industry share by the U.S. Bureau of Economic Analysis (BEA)’s “Make and Use” tables, following the same procedure as applied by Oliveira-Neto et al. [5]. The below equation represents the industry-to-industry mapping for deriving input variables for each industry’s freight attraction model:
X d i = j ω i j X d j
where,
X d i is the derived input variables for destination d by a linear combination of the shares of origin (make) industry i to destination (use) industry j ;
ω i j is the shares of origin industry i to destination industry j , obtained by the BEA’s make and use table.

3.4. Descriptive Statistics of Input Data

Table 2 shows the descriptive statistics of the input data for shipments by origins (freight production modeling), after combining the EC and CBP data with the 2017 CFS data. For each variable, the mean and standard deviation is provided. For tonnage and value, the number of data point (sample size) is also presented as they are different by NAICS. This is because the suppressed tonnage and value in the 2017 CFS data were excluded from this study. As a result, the number of sample size (N) is smaller than 132 (number of the CFS areas) for most of NAICSs. There are suppressed information in the EC/CBP data as well, but the suppressed information in the EC and CBP data were imputed at county level before merging with the 2017 CFS data. More detailed description of data processing is provided in Section 5 (Data Processing and Model Selection).
Similarly, Table 3 shows the descriptive statistics of the input data for shipments by destinations (freight attraction modeling). Note that the number of sample size (N) in Table 3 is 132 (number of the CFS areas) for all NAICSs. This reflects that commodity shipments for each industry could be limited for certain origin areas but can be shipped to any destination zones. Note that the NAICS codes that were not included in the BEA Make and Use table were excluded in this study as the information is required to derive input variables for the freight attraction modeling (estimating shipments by destinations, refer to Section 3.3. (Shipments by Destinations (Freight Attraction)). The excluded NAICS codes for shipments by destinations are NAICS 4233, 4235, 4237, 4239, 4243, 4245, 4246, 4248, 4249, and 45431.

4. Machine Learning Algorithms

Seven commonly used machine learning algorithms (i.e., Lasso, Decision Tree, Random Forest, Gradient Boosting, Support Vector, Gaussian Process, and Multi-layer Perceptron regressions), along with Ordinary Least Square (OLS) regression, were considered for the comparison.

4.1. Ordinary Least Squares Regression (OLS, the Baseline)

As discussed previously, the OLS regression is the most often used method in the FG models. Therefore, OLS method was used as the baseline to be compared with other ML algorithms. As the OLS has advantage of simplicity and interpretability over other ML algorithms, the OLS is still suggested as the final model if the model performance improvements by the ML algorithms are not found to be statistically significant.
Y o i = α i + β i X o i + ε o i
where,
Y o i is the tonnage/value of freight shipments originated from origin o and industry i ;
X o i is the set of explanatory variables for origin o and industry i ;
α i and β i are the parameter estimates of linear regression model for industry i .

4.2. Least Absolute Shrinkage and Selection Operator (Lasso)

Least absolute shrinkage and selection operator (Lasso) is a regression technique that performs both variable selection and regularization, by “shrinking” coefficients of regression models, to enhance the estimation accuracy while providing the interpretability of typical linear regression models. The Lasso shrinks coefficients by adding the penalty term to the residual sum of squares (RSS) that is to be minimized in the OLS.
M i n i m i z e   [ R S S + λ β i ]
The model flexibility decreases as λ increases, leading to smaller variance but larger bias. This is especially useful to mitigate overfitting, which is frequently observed for small sample size data. In the Lasso module of the Python scikit-learn library—version 0.24.2 [20], used in this study, the λ can be controlled with the hyperparameter named “alpha”. The Lasso method evaluated for the FG model in our study used the “alpha” ranging from 0 to 0.02 with an increment of 0.002.

4.3. Decision Tree Regression (DTR)

The Decision Tree is one of the popular non-parametric supervised learning methods used for classification and regression, depending on whether the dependent variable is categorical or continuous (like tonnage/value in this study). The main advantages of the Decision Tree regression (DTR) are: (1) the splits of nodes are unbiased; (2) each node contains a single model fit, relatively easier to interpret the model result; and (3) there are less limitations for applying the residuals, including general least squares.
Using the DecisionTreeRegressor module in the Python scikit-learn library [20], the following hyperparameter settings were evaluated and the hyperparameter setting with the lowest Root Mean Square Error (RMSE) was used for each model selection by DTR.
  • Maximum depth of the tree (max_depth): [1, 2, 3, 4, 5]
  • Complexity parameter used for the minimal cost-complexity pruning (ccp_alpha): [0, 0.002, …, 0.018, 0.02]

4.4. Random Forest Regression (RFR)

The Random Forest is an ensemble learning method for supervised learning, designed to improve model accuracy by randomly constructing multiple decision trees, rather than just one tree. Random Forest regression (RFR) is simply an ensemble of multiple random regression trees for the continuous dependent variables. The Random Forest is known to produce highly accurate estimation results for large sample sizes. Once the model is trained, the prediction process is relatively efficient, significantly faster than the training speed.
Using the RandomForestRegressor module in the Python scikit-learn library [20], the following hyperparameter settings were evaluated and the hyperparameter setting with the lowest RMSE was used for each model selection by RFR.
  • Maximum depth of the tree (max_depth): [1, 2, 3, 4, 5]
  • Complexity parameter used for the minimal cost-complexity pruning (ccp_alpha): [0, 0.002, …, 0.018, 0.02]
  • Number of trees in the forest (n_estimators): 10

4.5. Gradient Boosting Regression (GBR)

Gradient Boosting is another ensemble learning technique that forms multiple decision trees sequentially accounting for weak predictions of the previous decision trees. Specifically, the next trees are trained on the weighted data where more weights are assigned for the observations that were more difficult to estimate or classify in the previous iteration. If the sample size is sufficient for the training, the Gradient Boosting can outperform the Random Forest.
Using the GradientBoostingRegressor module in the Python scikit-learn library [20], the following hyperparameter settings were evaluated and the hyperparameter setting with the lowest RMSE was used for each model selection by GBR.
  • Maximum depth of the tree (max_depth): [1, 2, 3, 4, 5]
  • Complexity parameter used for the minimal cost-complexity pruning (ccp_alpha): [0, 0.002, …, 0.018, 0.02]
  • Learning rate to control the contribution of each tree (learning_rate): [0.01, 0.1, 1]
  • Number of the estimators (trees) (n_estimators): 10

4.6. Support Vector Regression (SVR)

Support Vector Machines (SVMs), which are more often used in classification, refer to a set of supervised learning for classification and regression, using a subset of training data as for the decision points, also called “support vectors”. Support Vector regression (SVR) is generally advantageous for high dimensional data, possibly effective even when number of dimensions is greater than the sample size. The choice of kernel functions and regularization parameters can be critical to avoid over-fitting in the SVR.
Using the SVR module in the Python scikit-learn library [20], the following hyperparameter settings were evaluated and the hyperparameter setting with the lowest RMSE was used for each model selection by SVR.
  • Margin of tolerance where no penalty is given to errors (epsilon): [0, 0.002, …, 0.018, 0.02]
  • Regularization parameter (C): [0.1, 0.3, …, 1.9, 2.1]
  • Kernel distribution type to be used in the algorithm (kernel):
    [Linear, Polynomial, Gaussian (RBF), Sigmoid]

4.7. Gaussian Process Regression (GPR)

Gaussian Process regression (GPR) is an extension of linear regression, where “Gaussian Process” represents finite linear combinations of random variables that are normally distributed. During the model fitting of GPR, the hyperparameters of the kernel are optimized to maximize the log-marginal-likelihood based on the passed optimizer. One of the main advantages by Gaussian Process is that the estimates can be provided in probabilistic forms where their empirical confidence interval can also be obtained.
Using the GaussianProcessRegressor module in the Python scikit-learn library [20], the following hyperparameter settings were evaluated and the hyperparameter setting with the lowest RMSE was used for each model selection by GPR.
  • Constant value added to the diagonal of the kernel matrix (alpha): [1 × 10−11, 1 × 10−10, 1 × 10−9]
  • Kernel specifying the covariance function of the model (kernel):
  • Combined two kernels, Dot-Product kernel and White kernel
  • For the Dot-Product kernel (DotProduct), the parameter sigma to control the inhomogeneity of the kernel: [0.5, 1.0, 1.5]
  • For the White kernel (WhiteKernel), the parameter noise_level to control the noise level of the kernel: [0.5, 1.0, 1.5]

4.8. Multi-Layer Perceptron Regression (MLP)

Multi-layer Perceptron (MLP) is a class of feedforward artificial neural network, where the “multi-layer” refers to consisting of at least three layers: input layer, hidden layer, and output layer. The MLP utilizes backpropagation for training, and different activation functions can be used for the training of hidden layers. The MLP is known to require relatively large data size for the training.
Using the MLPRegressor module in the Python scikit-learn library [20], the following hyperparameter settings were evaluated and the hyperparameter setting with the lowest RMSE was used for each model selection by MLP.
  • Hidden layer size and number of neurons in each hidden layer (hidden_layer_sizes)
  • Number of hidden layers: [1, 2, 3]
  • Number of neurons in each hidden layer: [3, 4, 5]
  • L2 penalty parameter (alpha): [0.00001, 0.0001, 0.001]
  • Activation function for the hidden layer (activation): [Identity, Logistic, Rectified Linear Unit (ReLU)]

5. Data Processing and Model Selection

5.1. Imputation of Missing Data (for CBP/EC)

For the records by origin (for all the 132 CFS areas) and industry (for the NAICSs covered in this study) in the published 2017 CFS tables, about 19% of tonnage and 8% of value are suppressed due to the sampling variability. These suppressed tonnage and value were excluded from evaluation in our study because different imputing methods could affect the modeling results inadvertently, especially for evaluating different modeling approaches.
The county level CBP and EC data also have suppressed information for the number of employments, annual payroll, number of establishments, and receipt total. Unlike the rest of the variables, the group for the range of number of employments are provided in the CBP data where the exact number of employments are suppressed. Therefore, in the first step, we imputed the number of employments by using the mid-point of the employment size range (EMPFLAG).
After the number of employments is imputed, the suppressed values for the rest of variables were imputed based on the ratio of the attribute value over the number of employments for known values. This imputation process was conducted at the county level data for each NAICS code. Once the imputation is completed, county-level data were aggregated to the CFS area-level to be merged with the 2017 CFS data.

5.2. Data Transformation

The FG modeling may require transformation of the input data to improve the accuracy since their relationships may not be linear as in the original units. In this study, we evaluated the model performance either in the original input data units or log-transformed values. The following equation indicates the case where both explanatory variables and dependent variables are transformed with natural logarithm.
l o g Y o i = α i + β i l o g   ( X o i ) + ε o i
Y o i = e x p   ( α i + ε o i ) · X o i β i
For more comparable model selection evaluation, the final model results were converted to the original units of tonnage and value if log-transformed.

5.3. Normalization

The normalization is the process of rescaling the input data to a similar range or distribution across different attributes. In our study, this is an essential process to improve the model stability and performance especially for more complex models, such as MLP. The normalization could be also helpful to interpret the importance of variables based on regression parameters that have different units if not normalized. In our study, a simple min-max normalization was used, where the min value was set to be zero for all cases. As such, the normalized value in our study is obtained simply by dividing the original value with the max value of each attribute.

5.4. Variable Selection

There are many different techniques for the variable selection, such as forward, backward, and stepwise selection. However, these techniques are heuristic approaches in that they choose or change subset of possible variable selection based on the previous variable selection result. Under our study, the number of explanatory variables is only four, except for NAICSs 212 and 551114 (only three without the receipt total). Therefore, instead of applying such variable selection techniques, this study evaluated all possible variable combinations among the four (or three) independent variables. The maximum number of possible combinations is 15   = 2 4 1 with four independent variables, excluding the one case that none of independent variables are selected.

5.5. Optimization of Hyperparameters

For each modeling approach, different hyperparameter setting could yield substantially different model performance results. Therefore, the authors attempt to test many different hyperparameters discussed in Section 4 (Machine Learning Algorithms). Then, the hyperparameter setting with the lowest RMSE among all tested settings was selected for each model approach and variable selection. Note that not all possible hyperparameters, such as minimum number of required samples at a leaf node in Decision Tree regression, were tested in the study.

5.6. Model Performance—Error Measurements

Three metrics were used for the model performance evaluation in this study:
  • Root Mean Square Error (RMSE): the square root of arithmetic mean of the squared difference between the 2017 CFS tonnage/value and the estimated tonnage/value
  • Mean Absolute Error (MAE): the arithmetic mean of the absolute difference between the 2017 CFS tonnage/value and the estimated tonnage/value
  • R-squared: the R-squared (or coefficient of determination) statistic between the 2017 CFS tonnage/value and the estimated tonnage/value
Both the RMSE and the MAE are commonly used to measure accuracy of continuous variables (i.e., tonnage and value). The study used the RMSE as the primary metric to determine the final model selection by NAICS, because the OLS regression (baseline model) fits to minimize the sum of squared error. In other words, using the MAE as the primary metric for the OLS could bias the final model selection toward preferring one of the alternative ML models over the OLS.
All the three metrics were evaluated based on only the validation sets. Note that the validation sets are based on K-fold cross validation where K is 4, with 25 times of repeats. Therefore, there are 100 validation sets and associated performance metrics observed for each model selection. The K-fold cross validation is useful especially when dealing with small dataset (N ≤ 132 for each industry), since the data is split into K number of folds making all parts of the data being equally used as part of the validation sets.
Furthermore, unadjusted R-squared was used, instead of adjusted R-squared which is used as a correction to the unadjusted R-squared for the case with multiple predictors. This is because the R-squared was obtained only based on the validation sets, not the training set, where the model complexity is already accounted in the estimates of validation sets. Finally, alternative models, other than OLS, are selected only when the reduction of RMSE appears to be statistically significant by paired T-test and Wilcoxon statistics with the p-value of 0.05.

6. Model Results

To better understand the model selection process, we present an example of model selection process with the case of estimating tonnage of shipments by origins for NAICS 212. Then, the following sections will discuss the significance of model improvement and summarize the final model for each industry.

6.1. Example of Model Selection—Tonnage of Shipments by Origins for NAICS 212

The model selection was considered with different aspects, i.e., variable selection, log-transform, and ML algorithms, as well as the hyperparameter settings of each ML algorithm. For an easier understanding of the model selection choices considered for each NAICS, Table 4 presents an example of model selection with associated RMSEs for estimating tonnage of shipments by origins for NAICS 212.
As presented in Table 4, there is a total of 112 choices for the model selection of NAICS 212: 2 choices for the log-transform, 2 choices for each variable (excluding the case with no explanatory variable), and 8 different algorithms. In fact, there is no receipt total information at county-level (available for only state level) from the EC table for the NAICS 212 and 551114, but all the other industry types in this study have the county-level receipt total information. Therefore, the total number of possible model selections was 240 for industries other than NAICS 212 and 551114.
In addition to the summarized model selection in Table 4, the different hyperparameter settings were tested as well and then only the hyperparameter settings with the lowest RMSEs were presented in Table 4. For the NAICS 212 tonnage estimation, the SVR model with number of employees and number of establishments was selected as the best alternative model since it yielded the lowest RMSE among all the options.
Finally, the alternative ML algorithm was suggested as the final model only when the reduction of RMSE over OLS is statistically significant with paired T-test and Wilcoxon statistics, as shown in Table 5.

6.2. Significance of Model Performance Improvement by Industry

Figure 1 shows the box plots of three model performance measurements based on 100 validation sets (K-fold cross validation where K is 4, with 25 times of repeats) for two dependent variables: (a) tonnage of shipments by origins for NAICS 333 and (b) tonnage of shipments by destinations for NAICS 337. The two cases were chosen intentionally to provide two distinguishable examples of “with” versus “without” significant improvement by the alternative ML algorithms. More specifically, the first example on the left side (Figure 1a) shows the case where the alternative ML algorithm does improve the model performance significantly, whereas none of the alternative ML models showed statistically significant improvement over OLS for the example on the right side (Figure 1b).
Since the RMSEs and MAEs may not be directly comparable across different NAICS, the relative differences between RMSE and MAE were compared to the OLS. The relative RMSE/MAE is calculated as the difference of RMSE/MAE between each ML algorithm and the OLS divided by the RMSE/MAE of OLS. The dotted line in Figure 1 represents the arithmetic mean of each performance metric for the baseline algorithm, OLS.
For estimating tonnage of shipments by origins for NAICS 333 (Figure 1a), the SVR algorithm clearly shows that the third quartiles (i.e., upper bound of the colored box) for both RMSE and MAE are lower than the average RMSE/MAE by OLS. In addition, the first quartiles (i.e., lower bound of the colored box) for the R-squared values are also higher than the average R-squared by OLS method.
Comparably, for estimating tonnage of shipments by destinations for NAICS 337 (Figure 1b), all the ML algorithms have the third quartile of both RMSE and MAE higher than the average RMSE/MAE by OLS. For the R-squared value of the NAICS 337 tonnage estimation, the median (i.e., the mid-line inside of the box) of R-squared values by DTR, RFR, GBR, and MLP are even lower than the average R-squared value by OLS.
In addition to the visual investigation of the box plots in Figure 1, statistical tests were conducted to evaluate the significance of model performance improvements. Specifically, as shown in Table 5 and Table 6, two statistical tests, paired t-test and Wilcoxon, for the difference of RMSE between the OLS and the alternative best ML method for each NAICS were evaluated with a significance level of p-value 0.05. The alternative method was suggested as the final model only when both of the test statistics show significant improvements of RMSE, as compared to the RMSE by OLS. Note that no alternative ML methods are provided in Table 5 and Table 6, where OLS performed better than all the seven alternative ML methods. Overall, both t-test and Wilcoxon statistics yield fairly consistent conclusions in terms of which industry types were improved significantly over the OLS by applying the alternative ML method.
As shown in Table 5 and Table 6, about 57% of cases for estimating shipments by origins show a reduction of RMSE that are statistically significant, while 67% of estimating shipments by destinations show a statistically significant improvement. Overall, for the cases where the alternative ML methods bring a statistically significant improvement, the RMSE reduction is ranged from 0.1% to 30.6%.

6.3. Summary of Best Model by Industry

Table 7 summarizes the final model suggestion for each NAICS code, which was determined based on the significance tests on Table 5 and Table 6. For each NAICS code and measurement, the final model algorithm along with its variable selection and use of log-transformation is provided.
Overall, as shown in Table 7, the SVR was selected as the best model for 52 NAICS tonnage/value cases (58%) for the estimation of shipments by origins (freight generation). For the estimation of shipments by destinations (freight attraction), both the SVR and GPR were selected as the best model for 29 cases (41%) each.
The OLS, selected only when none of the seven alternative ML algorithms showed the significant reduction in RMSE, was selected for 23 cases (26%) for the estimation of shipments by origins (freight generation) and for only 3 cases (4%) for the estimation of shipments by destinations (freight attraction). The MLP, which can be arguably considered as the most complex model among the eight models, was selected for only one case for estimating tonnage of shipments by destinations for NAICS 316.
In terms of the variable selection, the number of employee (EMP) was included most, 54 cases (60%), for estimating shipments by origins (freight generation). For the estimation of shipments by destinations (freight attraction), the number of establishment (ESTAB) was included most, 48 cases (69%), in the final model selection. In addition, the results show that the log-transform would improve the overall model performance for 41 cases (46%) in the estimation of shipments by origins and for 39 cases (56%) in the estimation of shipments by destinations. In addition, the receipt total (RCPTOT), which was not considered in any of the referenced study, was included in 49% of the final models. Note that this is only a summary of how many times each variable is selected for all the 45 NAICS codes. As discussed in Section 5.4 and Section 5.6, the variable selection was determined by RMSE of validation sets, considering all possible combinations.

6.4. Discussions in Model Interpretability

Oftentimes, a regression-based modeling approach can be explained with explicit equational forms to represent the relationship between the dependent variable and the independent variables. The following two equations, by OLS and Lasso, show the example of final model for estimating values of shipments by origins for NAICS 4239. Note that the estimated coefficient for the number of employee variable (EMP) is smaller in the Lasso regression (1.071) than the same coefficient estimate in the OLS regression (1.096).
OLS   Regression :   V a l u e 4238 ^ = exp 0.058 · E M P 4238 1.096
Lasso :   V a l u e 4238 ^ = exp 0.111 · E M P 4238 1.071
This straightforward interpretability is one of the clear advantages for utilizing simple regression-based modeling approaches, such as OLS and Lasso. However, one can choose more complex models with higher model performance if the model performance (the focus of this study) is more important for their applications. In addition, such complex models could still provide insights of which factors are affecting more on the tonnage and value of shipments by exploring the variable importance. For example, Figure 2 shows the variable importance for the Support Vector Regression (SVR) model that estimates value of shipments by origins for NAICS 322. The variable importance was calculated by permutation feature imputation technique, where we measured the decreased R-squared value by randomly shuffling a single feature value. In this case, the annual payroll (PAYANN) appears to be impacting the most to the model estimates.

7. Conclusions

This study explored eight models, i.e., Ordinary Least Square (OLS) regression, Lasso, Decision Tree, Random Forest, Gradient Boosting, Support Vector, Gaussian Process, and Multi-layer Perceptron regressions, applied for the FG models by industry type (NAICS code). The seven alternative ML algorithms, which have been commonly used for regression but not often in FG modeling, were evaluated whether the model performance improvement is significant over the OLS. Overall, the Support Vector regression was selected most as the best model approach for the estimation of shipments by origins, while both the Support Vector regression and the Gaussian Process regression were equally selected most as the best model approach for the estimation of shipments by destinations. Combining all the cases of shipments by origins and destinations, the RMSE reductions (compared to OLS) for 134 cases (84%) are, ranged from 0.1% to 30.6%, statistically significant with both paired t-test and Wilcoxon statistics.
The following summarizes the key contributions of this study:
  • Built a framework to conduct the industry-specific model selection, i.e., the variation selection, log-transform, and algorithm.
  • Evaluated the significance of model improvements when using the alternative ML algorithms over the OLS for the FG modeling.
  • Suggested the use of OLS regression for certain NAICSs if the RMSE reductions by the alternative ML algorithms are not statistically significant.
  • Considered all possible variable combinations from the four variables in the CBP and EC data tables.
  • Covered all the NAICS codes from the 2017 CFS data and estimated tonnage/value of freight shipments by both origins (generation) and destinations (attraction).
Although the study focused on model performance in applying ML algorithms for the FG models, simplicity and interpretability of model approaches could be more important depending on their applications. This is one of the main reasons why alternative ML algorithm is being selected over OLS only when the improvement is “statistically significant”. Note that most of complex ML models may not be provided with explicit equational forms, but their variable importance can be still obtained, as discussed in Section 6.4. (Discussions in Model Interpretability).
The scope of this study is limited to estimating tonnage and value of the freight shipments by industry type (NAICS codes). The proposed model selection results could be quite different when different dependent variables, such as truck volume and number of shipments, are to be estimated.
Furthermore, there can be more variables, such as population, GDP, access to ports, network access/length by mode, land use, etc., to be considered to improve model performance depending on industry types and data availability. Additionally, note that not all hyperparameters were evaluated for each ML algorithm, meaning that there may be potential further improvements with hyperparameter settings not considered in this study. The authors expect that more complex algorithms, such as Random Forest, Gradient Boosting, and Multi-layer Perceptron regressions, are more likely to outperform the OLS with larger size of training data (e.g., the data at the establishment level or more granular level of geography). With all, the authors believe that the future research in FG modeling can be focused on the following areas:
  • Applying the proposed framework with use case of disaggregating freight data into more granular level of geography (e.g., county-level freight data).
  • Using other external/private data sources to reveal the relationship between economy activity and associated freight shipments at individual business level.
  • Expanding the model framework to forecasting future freight demand by industry type.

Author Contributions

Conceptualization, H.L., M.U., S.-M.C. and H.-L.H.; Data curation, H.L.; Formal analysis, H.L. and M.U.; Methodology, H.L.; Writing—original draft, H.L., M.U. and Y.L.; Writing—review and editing, H.L., M.U. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research effort was sponsored by the Federal Highway Administration (FHWA) and the Bureau of Transportation Statistics (BTS), under U.S. Department of Transportation, through the project titled “Design and Development of Statistical Models and Freight Data”, grant number 2116-Z239-18.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1 summarizes the description of 45 North American Industry Classification System (NAICS) codes used in this study.
Table A1. Description of 45 NAICS Codes Used in the Study.
Table A1. Description of 45 NAICS Codes Used in the Study.
NAICS CodeDescription
212Mining (except oil and gas)
311Food manufacturing
312Beverage and tobacco product manufacturing
313Textile mills
314Textile product mills
315Apparel manufacturing
316Leather and allied product manufacturing
321Wood product manufacturing
322Paper manufacturing
323Printing and related support activities
324Petroleum and coal products manufacturing
325Chemical manufacturing
326Plastics and rubber products manufacturing
327Nonmetallic mineral product manufacturing
331Primary metal manufacturing
332Fabricated metal product manufacturing
333Machinery manufacturing
334Computer and electronic product manufacturing
335Electrical equipment, appliance, and component manufacturing
336Transportation equipment manufacturing
337Furniture and related product manufacturing
339Miscellaneous manufacturing
4231Motor vehicle and motor vehicle parts and supplies merchant wholesalers
4232Furniture and home furnishing merchant wholesalers
4233Lumber and other construction materials merchant wholesalers
4234Professional and commercial equipment and supplies merchant wholesalers
4235Metal and mineral (except petroleum) merchant wholesalers
4236Household appliances and electrical and electronic goods merchant wholesalers
4237Hardware, plumbing and heating equipment and supplies merchant wholesalers
4238Machinery, equipment, and supplies merchant wholesalers
4239Miscellaneous durable goods merchant wholesalers
4241Paper and paper product merchant wholesalers
4242Drugs and druggists’ sundries merchant wholesalers
4243Apparel, piece goods, and notions merchant wholesalers
4244Grocery and related product merchant wholesalers
4245Farm product raw material merchant wholesalers
4246Chemical and allied products merchant wholesalers
4247Petroleum and petroleum products merchant wholesalers
4248Beer, wine, and distilled alcoholic beverage merchant wholesalers
4249Miscellaneous nondurable goods merchant wholesalers
4541Electronic shopping and mail-order houses
45431Fuel dealers
4931Warehousing and storage
5111Newspaper, periodical, book, and directory publishers
551114Corporate, subsidiary, and regional managing offices

References

  1. U.S. Department of Transportation. Bureau of Transportation Statistics and Federal Highway Administration, Freight Analysis Framework Version 5.4 (FAF5). Available online: https://www.bts.gov/faf (accessed on 25 October 2022).
  2. U.S. Department of Transportation. Bureau of Transportation Statistics and U.S. Department of Commerce, U.S. Census Bureau. 2017 Commodity Flow Survey. Available online: https://www2.census.gov/programs-surveys/cfs/data/2017 (accessed on 1 August 2022).
  3. Holguin-Veras, J.; Sarmiento, I.; Gonzalez-Calderon, C.A. Parameter Stability in Freight Generation and Distribution Demand Models in Colombia. Dyna 2011, 78, 16–20. [Google Scholar]
  4. Lim, R.; Qian, Z.S.; Zhang, H.M. Development of a Freight Demand Model with an Application to California. Int. J. Transp. Sci. Technol. 2014, 3, 19–38. [Google Scholar] [CrossRef] [Green Version]
  5. Oliveira-Neto, F.M.; Chin, S.M.; Hwang, H.L. Aggregate Freight Generation Modeling: Assessing Temporal Effect of Economic Activity on Freight Volumes with Two-Period Cross-Sectional Data. Transp. Res. Rec. 2012, 2285, 145–154. [Google Scholar] [CrossRef]
  6. Krisztin, T. Semi-Parametric Spatial Autoregressive Models in Freight Generation Modeling. Transp. Res. Part E Logist. Transp. Rev. 2018, 114, 121–143. [Google Scholar] [CrossRef]
  7. Hagenauer, J.; Helbich, M. A Comparative Study of Machine Learning Classifiers for Modeling Travel Mode Choice. Expert Syst. Appl. 2017, 78, 273–282. [Google Scholar] [CrossRef]
  8. Uddin, M.; Anowar, S.; Eluru, N. Modeling Freight Mode Choice Using Machine Learning Classifiers: A Comparative Study Using Commodity Flow Survey (CFS) Data. Transp. Plan. Technol. 2021, 44, 543–559. [Google Scholar] [CrossRef]
  9. Iranitalab, A.; Khattak, A. Comparison of Four Statistical and Machine Learning Methods for Crash Severity Prediction. Accid. Anal. Prev. 2017, 108, 27–36. [Google Scholar] [CrossRef] [PubMed]
  10. Rahman, S.; Bhasin, A.; Smit, A. Exploring the Use of Machine Learning to Predict Metrics Related to Asphalt Mixture Performance. Constr. Build. Mater. 2021, 295, 123585. [Google Scholar] [CrossRef]
  11. Salais-Fierro, T.; Martínez, A. Demand Forecasting for Freight Transport Applying Machine Learning into the Logistic Distribution. Mob. Netw. Appl. 2022, 27, 2172–2181. [Google Scholar] [CrossRef]
  12. Chin, S.M.; Hwang, H.L. National Freight Demand Modeling: Bridging the Gap Between Freight Flow Statistics and US Economic Patterns. In Proceedings of the 86th Annual Meeting of the Transportation Research Board, Washington, DC, USA, 21–25 January 2007. [Google Scholar]
  13. Novak, D.C.; Hodgdon, C.; Guo, F.; Aultman-Hall, L. Nationwide Freight Generation Models: A Spatial Regression Approach. Netw. Spat. Econ. 2011, 11, 23–41. [Google Scholar] [CrossRef]
  14. Bagighni, S. Volume Estimation Models for Generation and Attraction of Freight Commodity Groups Using Regression Analysis. Ph.D. Dissertation, The University of Alabama in Huntsville, Huntsville, AL, USA, 2012. [Google Scholar]
  15. Ha, D.H.; Combes, F. Building a Model of Freight Generation with a Commodity Flow Survey. In Commercial Transport; Springer International Publishing: Cham, Switzerland, 2016; pp. 23–37. [Google Scholar]
  16. Mommens, K.; Van Lier, T.; Macharis, C. Freight Demand Generation on Commodity and Loading Unit Level. Eur. J. Transp. Infrastruct. Res. 2017, 17, 1. [Google Scholar] [CrossRef]
  17. National Academies of Sciences, Engineering, and Medicine. NCFRP Report 37: Using Commodity Flow Survey Microdata and Other Establishment Data to Estimate the Generation of Freight, Freight Trips, and Service Trips: Guidebook; Transportation Research Board: Washington, DC, USA, 2016. [Google Scholar]
  18. U.S. Department of Commerce, U.S. Census Bureau. 2017 Economic Census Data. Available online: https://www.census.gov/programs-surveys/economic-census/year/2017/economic-census-2017/data.html (accessed on 1 August 2022).
  19. U.S. Department of Commerce, U.S. Census Bureau. 2017 County Business Patterns. Available online: https://www.census.gov/data/datasets/2017/econ/cbp/2017-cbp.html (accessed on 1 August 2022).
  20. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. JMLR 2011, 12, 2825–2830. [Google Scholar]
Figure 1. Box Plot with Model Performance: (a) Tonnage of Shipments by Origins for NAICS 333; (b) Tonnage of Shipments by Destinations for NAICS 337: the box extends from the first quartile (Q1) to the third quartile (Q3) and the whiskers extend at the farthest data points within the interval, no more than 1.5× the interquartile (Q3-Q1) from the edges of the box; the rhombus marks (♦) represent the data points outside this range of the whiskers.
Figure 1. Box Plot with Model Performance: (a) Tonnage of Shipments by Origins for NAICS 333; (b) Tonnage of Shipments by Destinations for NAICS 337: the box extends from the first quartile (Q1) to the third quartile (Q3) and the whiskers extend at the farthest data points within the interval, no more than 1.5× the interquartile (Q3-Q1) from the edges of the box; the rhombus marks (♦) represent the data points outside this range of the whiskers.
Sustainability 14 15367 g001
Figure 2. Example of Variable Importance in SVR—Value of Shipments by Origins for NAICS 322.
Figure 2. Example of Variable Importance in SVR—Value of Shipments by Origins for NAICS 322.
Sustainability 14 15367 g002
Table 1. Summary of Studies on the Modeling of Freight Generation.
Table 1. Summary of Studies on the Modeling of Freight Generation.
StudyStudy AreaData SourceScope of AnalysisVariables ConsideredMethods UsedModel Performance
Chin and Hwang [12]United StatesCommodity Flow Survey (CFS)CFS Area and Industry SectorEmployment and Establishment SizeOLS RegressionExcept for four models, R2 > 0.70
Key Findings: With additional modeling efforts, the developed models could be enhanced to allow transportation analysts to assess regional economic impacts.
Holguin-Veras et al. [3]ColombiaFreight Origin-Destination SurveyRegion (made up of municipality and 4 countries)Gross Domestic Product (GDP), Existence of PortOLS RegressionAdjusted R2:
[0.86, 0.96]
Key Findings: On average $1600 of GDP is needed to produce a ton of freight.
Novak et al. [13]United StatesCFS and TranSearchCFS AreaPopulation, Number of Employees, Port, Highway LengthOLS and Spatial RegressionR2: [0.33, 0.63]
Key Findings: It is recommended to avoid the overuse and addition of highly correlated explanatory variables such as employment and population even when this improves R2; spatial regression model is the preferred specification for freight generation at the national level.
Bagighni [14]United StatesFreight Analysis Framework (FAF)FAF Zone and CommodityPopulation, Median Age, Income, Number of Jobs by Industry SectorOLS RegressionAdjusted R2: [0.54, 0.81]
Key Findings: It is possible to develop good freight volume estimating models for individual commodities using regression analysis; however, the level of success for each commodity model varies.
Oliveira-Neto et al. [5]United StatesCFSState and IndustryAnnual Payroll by Industry SectorOLS RegressionR2: [0.40, 0.98]
Key Findings: Payroll can explain a significant portion of the freight production at the state level for the U.S.
Lim et al. [4]CaliforniaFAFFAF Zone and Commodity GroupNumber of Employees, Population, Farmland Acres, Crop and Livestock Sales, Net Annual Electrical Generation using CoalOLS RegressionR2: [0.21, 0.83]
Key Findings: Models without constant terms have a better fit than models with constant; model fit is dependent on the commodity grouping and the choice of explanatory variables.
Ha and Combes [15]FranceFrench Shipper Survey ECHOEstablishmentEmployment, Economic Activity, Relations with Economic Agents,
Production and Logistics Characteristics
One-way ANOVA and OLS RegressionR2: [0.16, 0.45]
Key Findings: The number of employees and the economic sector were identified as very important explanatory variables.
Mommens et al. [16]BelgiumFreight volume data compiled from multiple sourcesTraffic Analysis Zone and CommodityNumber of Employees, Establishment Size, Gross Floor Space, Population DensityOLS RegressionR2: [0.31, 0.69]
Key Findings: It is doubtful that the addition of new explanatory variables will improve the model fit and consequently improvements in model accuracy.
National Academies of Sciences, Engineering, and Medicine [17]United StatesCFSIndustryNumber of EmployeesOLS Regression (linear and non-linear specifications) and Multiple Classification AnalysisAdjusted R2:
[0.01, 0.73]
Key Findings: The use of the CFS in combination with complementary datasets provides an efficient way to estimate freight generation (FG) models for the entire nation at various levels of geography; non-linear models typically provide the best representation of FG patterns.
Krisztin [6]European NUTS-2 regionsEurostatCountryRegional Share of Employment, Regional Share of Employment in Agriculture and Manufacturing, Length of Road Network, and Distance to the Closest SeaportSpatial Autoregressive ModelAdjusted R2: [0.39, 0.87]
Key Findings: There are significant non-linearities related to employment rates in manufacturing and infrastructure capabilities in the study regions.
Table 2. Descriptive Statistics of the Input Data for Shipments by Origins.
Table 2. Descriptive Statistics of the Input Data for Shipments by Origins.
NAICSTonnage
(Thousand Tons)
Value
(Million $)
Number of Establishments
(Count)
Number of Employments
(Count)
Annual Payroll (Million $)Receipt Total (Million $)
NMeanStd. 1NMeanStd.MeanStd.MeanStd.MeanStd.MeanStd.
21211923,57034,993118710130834351127174176128N/A 2N/A
311123486569571276255744719924111,46612,58450757943925726
31210712962083111138429486211016282689801606661339
3136789184101284629112857716042464121393
3147954195116214656354668716042566114521
3154392698140478422266863092188465364
3163416307657816121463525121236
32111319392876125889114210310830473593121142610810
32211214111827113166519692634210328731241776861076
32310816425212667583818222134694140155204535751
3249812,68724,707112462311,030101365513636815420676675
325122569411,20012757419853931175618708943963642969013
32612049761612918692205861055456669626031613911868
327106667569881299939311038528972594149138730753
331103159027521121915251227372346346014923910111881
33211489912841302700301640246310,91712,49455866324232860
333108308471125295933771692007815843848855022682774
3347931461182660497289160605210,593521102918113814
3359416022211510981256365721382938135208503881
3369710692324112809313,9838110411,61616,5177491144473610,549
33711512217812661088910112427294096113172424785
339100668112412741743203270425059282343869521652
42311046911783124534911,3771732373175442616931122876471
42329619234011878814309720313572644781554871150
4233961703209512913191544125128191220821081287591063
423487241564119418380842553754756763441887731237721
423511710581669124157324586710111241714701177372215
4236107275527125432586582083493931729337798628296764
423711320924613213001518143157197422801211517471021
423895563951128387944494344666081651338845825893641
4239992240406812517863464234495259043691362454401947
4241103362587121106218196711511151951641194721230
424286196782112629912,078691582188543825480720546738
4243861023821031360440911651316156106984039354719
424412530604220131658093452574556383878733848437817630
42458210,62818,11389231435244593531107428547861789
424611012302332122150627928411311311667831476461848
424711611,93722,794122684712,943363666493255140298624,341
424812752657912912621820284913761887861486001519
424989221445381292410312420930227593388134167235763
454199207433123431488643075384291669819435331297868
493111922352416121920711,19312014468489242297398208462
511177232493146211107129266144611644135081794
4543112043372212427239350685358592443148344
55111456362475691324189137540026,40634,60228234615N/AN/A
1 Std.: Standard Deviation; 2 N/A: Not Available.
Table 3. Descriptive Statistics of the Input Data for Shipments by Destinations.
Table 3. Descriptive Statistics of the Input Data for Shipments by Destinations.
NAICSTonnage
(Thousand Tons)
Value
(Million $)
Number of Establishments
(Count)
Number of Employments
(Count)
Annual Payroll (Million $)Receipt Total (Million $)
NMeanStd. 1NMeanStd.MeanStd.MeanStd.MeanStd.MeanStd.
21213221,95122,7291327029193334109016827312444137
311132474451891326041721819824211,17512,30849856943285619
31213211881597132121225046510818992793981678441459
31313249100132224394122758015252562130375
31413240120132195448405194216173968181486
31513246132106198372086062848167858335
31613252013238796121763446123049
3211321708182413284773210210829983577119142602802
32213212361454132145417162635199528391171746471053
32313215423713265081917120732933913149195515722
32413210,43821,85713241859811101362013046314719156321
3251325412904413256458116881115305669841360041068648
32613249956013218262076861075338659226031714331910
327132648464471329829851028528992597149139733756
33113213261959132172121382940227934351442359581823
33213286410061322684270739445310,74912,27655065324012830
333132320468132284529691682017697842648054822492783
33413241551322437423988158590510,349504100217883735
3351321241491329811059375821392967135210518897
3361328581941132727612,1268010411,08716,1837151119450610,261
33713211511313258562310212527394075114172435785
3391325873132126115742142844431609525040411001844
42311326841202132508666521652283042426016229921896221
42321321892601327298441626224425,37733,3481758278413,13724,316
4234132208308132394252072293384274688237679028036957
4236132289399132416857412013373800705036495327356539
42381327021129132379136833884175444583134741023173259
42411323645741329921350852157116,78325,3331074189211,69332,900
424213216229413259647160641492025512123575919006339
424413230283764132653782512484406164848632646836517368
424713210,90521,166132646712,033323359685049127267722,080
4541132185232132409550093535594773697421636832387905
4931132208318991328955855011413764668827281380198443
511113231641321813547996199033111243073831331
55111413254012351321125154836139525,33634,02726964504N/A 2N/A
1 Std.: Standard Deviation; 2 N/A: Not Available.
Table 4. RMSE by Model Selection—Tonnage of Shipments by Origins for NAICS 212.
Table 4. RMSE by Model Selection—Tonnage of Shipments by Origins for NAICS 212.
Variable SelectionOLS 1Lasso 2DTR 3RFR 4GBR 5SVR 6GPR 7MLP 8
No Log-TransformEMP26,91130,41932,94732,67230,16328,09326,98345,787
PAYANN29,91432,34632,89832,68727,12526,70529,99751,983
ESTAB27,22128,14132,89829,65829,48127,50827,10043,834
EMP, PAYANN28,81430,41932,87730,18428,46927,32830,58341,621
EMP, ESTAB27,77530,41932,89830,34531,92225,34128,37850,355
PAYANN, ESTAB30,01532,34632,89830,78531,76926,21530,30645,076
EMP, PAYANN, ESTAB29,70830,41936,09331,68031,60526,75531,38448,811
Log-TransformEMP25,60825,74229,34828,22130,28227,64525,43686,158
PAYANN26,08326,18130,70128,38330,35028,65025,95240,286
ESTAB29,12229,24829,31828,72630,82627,47828,98036,600
EMP, PAYANN25,98926,10229,51528,12330,25428,56325,83398,109
EMP, ESTAB25,73325,84130,21728,27830,34427,09226,36345,223
PAYANN, ESTAB26,28126,35430,72128,48530,36627,14126,741415,283
EMP, PAYANN, ESTAB26,13426,21930,05428,30430,30027,09226,67933,675
1 OLS: Ordinary Least Squares Regression, 2 Lasso: Least Absolute Shrinkage and Selection Operator, 3 DTR: Decision Tree Regression, 4 RFR: Random Forest Regression, 5 GBR: Gradient Boosting Regression, 6 SVR: Support Vector Regression, 7 GPR: Gaussian Process Regression, 8 MLP: Multi-layer Perceptron.
Table 5. Significance of Improvement by ML algorithms over OLS—Shipments by Origins.
Table 5. Significance of Improvement by ML algorithms over OLS—Shipments by Origins.
NAICSMeasureAlternativeRMSEt-TestWilcoxon
OLSAlternative% Dif.Stat.p-ValueStat.p-Value
212tonsSVR25,73325,341−1.5%2.410.018 *17800.01 *
valueSVR466449−3.8%4.22<0.0005 *15660.001 *
311tonsGPR44473963−10.9%11.31<0.0005 *131<0.0005 *
valueSVR30202576−14.7%8.75<0.0005 *541<0.0005 *
312tonsSVR16671414−15.2%5.79<0.0005 *891<0.0005 *
valueSVR19851894−4.6%5.84<0.0005 *894<0.0005 *
313tonsOLS79------
valueSVR239226−5.5%4.80<0.0005 *1288<0.0005 *
314tonsGPR6747−29.1%7.74<0.0005 *318<0.0005 *
valueSVR158144−9.2%2.780.006 *1413<0.0005 *
315tonsOLS11------
valueOLS135------
316tonsGPR2626−0.1%1.200.23216800.004 *
valueSVR6866−2.7%3.570.001 *17130.005 *
321tonsRFR19461605−17.5%7.76<0.0005 *721<0.0005 *
valueSVR384311−19.2%11.19<0.0005 *247<0.0005 *
322tonsSVR16271492−8.3%12.85<0.0005 *161<0.0005 *
valueSVR11671128−3.4%4.66<0.0005 *1148<0.0005 *
323tonsSVR179150−16.1%11.12<0.0005 *183<0.0005 *
valueSVR244241−1.4%0.980.33121860.244
324tonsSVR11,29010,730−5.0%3.86<0.0005 *15230.001 *
valueSVR50094389−12.4%6.45<0.0005 *506<0.0005 *
325tonsGPR88878783−1.2%8.09<0.0005 *298<0.0005 *
valueSVR35923409−5.1%3.360.001 *19450.046 *
326tonsSVR319279−12.4%10.15<0.0005 *390<0.0005 *
valueSVR771732−5.1%7.72<0.0005 *569<0.0005 *
327tonsSVR47684441−6.9%9.39<0.0005 *383<0.0005 *
valueSVR361355−1.5%1.510.13319090.034 *
331tonsSVR15251341−12.1%5.78<0.0005 *875<0.0005 *
valueSVR966926−4.1%7.04<0.0005 *727<0.0005 *
332tonsSVR925909−1.7%1.070.28624730.858
valueSVR627622−0.8%1.690.09518260.016 *
333tonsSVR333231−30.6%18.83<0.0005 *9<0.0005 *
valueSVR13911186−14.7%14.25<0.0005 *51<0.0005 *
334tonsSVR4237−13.4%14.22<0.0005 *-<0.0005 *
valueOLS1085------
335tonsSVR216201−6.8%10.93<0.0005 *347<0.0005 *
valueOLS730------
336tonsOLS1662------
valueSVR44784437−0.9%1.160.24919180.037 *
337tonsSVR10896−11.2%10.16<0.0005 *442<0.0005 *
valueOLS225------
339tonsDTR6966−4.3%3.020.003 *16870.004 *
valueSVR871642−26.3%6.39<0.0005 *739<0.0005 *
4231tonsOLS876------
valueOLS4492------
4232tonsSVR131118−9.9%4.91<0.0005 *1418<0.0005 *
valueSVR296289−2.4%2.310.023 *24320.749
4233tonsGBR12951110−14.3%6.92<0.0005 *801<0.0005 *
valueSVR446419−6.1%5.63<0.0005 *963<0.0005 *
4234tonsSVR333327−1.7%1.040.29919680.055
valueOLS2746------
4235tonsSVR969878−9.4%3.470.001 *15200.001 *
valueOLS1031------
4236tonsLasso372371−0.3%1.540.12623120.464
valueSVR39053686−5.6%4.04<0.0005 *1291<0.0005 *
4237tonsOLS118------
valueSVR543513−5.5%6.31<0.0005 *924<0.0005 *
4238tonsSVR544519−4.7%1.030.30520480.101
valueLasso13911382−0.6%0.920.3622970.433
4239tonsLasso14941477−1.1%0.750.45223560.561
valueLasso793743−6.4%1.720.08922750.39
4241tonsOLS382------
valueOLS843------
4242tonsSVR574568−1.0%3.330.001 *16420.002 *
valueOLS8976------
4243tonsOLS96------
valueSVR1175845−28.1%5.69<0.0005 *971<0.0005 *
4244tonsOLS1154------
valueOLS1949------
4245tonsRFR87718485−3.3%1.100.27523670.587
valueDTR18091695−6.3%2.200.03 *17000.005 *
4246tonsOLS1525------
valueOLS1075------
4247tonsOLS16,491------
valueSVR84908245−2.9%1.780.07821710.224
4248tonsOLS314------
valueSVR614571−6.9%3.69<0.0005 *17530.008 *
4249tonsSVR26082235−14.3%4.93<0.0005 *15150.001 *
valueSVR16851652−2.0%3.130.002 *15750.001 *
4541tonsSVR282276−2.2%2.950.004 *16610.003 *
valueSVR51024833−5.3%4.54<0.0005 *1367<0.0005 *
45431tonsSVR324303−6.6%3.82<0.0005 *15200.001 *
valueOLS119------
4931tonsSVR14941394−6.7%5.97<0.0005 *231<0.0005 *
valueSVR58545505−6.0%3.98<0.0005 *794<0.0005 *
5111tonsSVR2120−6.3%8.79<0.0005 *493<0.0005 *
valueSVR183182−0.4%0.940.35218520.021 *
551114tonsRFR493480−2.7%2.390.019 *16800.004 *
valueDTR17331653−4.7%3.80<0.0005 *1503<0.0005 *
(* p-value < 0.05)
OLS: Ordinary Least Squares Regression, Lasso: Least Absolute Shrinkage and Selection Operator, DTR: Decision Tree Regression, RFR: Random Forest Regression, GBR: Gradient Boosting Regression, SVR: Support Vector Regression, GPR: Gaussian Process Regression, MLP: Multi-layer Perceptron.
Table 6. Significance of Improvement by ML algorithms over OLS—Shipments by Destinations.
Table 6. Significance of Improvement by ML algorithms over OLS—Shipments by Destinations.
NAICSMeasureAlternativeRMSEt-TestWilcoxon
OLSAlternative%Dif.Stat.p-ValueStat.p-Value
212tonsRFR18,45217,962−2.7%2.670.009 *17950.012 *
valueSVR770756−1.8%5.20<0.0005 *1042<0.0005 *
311tonsGPR20832078−0.3%0.540.5920950.139
valueSVR35373482−1.6%1.870.06518430.019 *
312tonsSVR10771050−2.6%4.86<0.0005 *1038<0.0005 *
valueSVR18271796−1.7%4.08<0.0005 *15380.001 *
313tonsGPR7373−0.3%0.470.63816430.002 *
valueSVR261254−2.9%4.28<0.0005 *1332<0.0005 *
314tonsGPR4743−7.6%6.17<0.0005 *776<0.0005 *
valueGPR121120−0.2%2.130.036 *20360.093
315tonsGPR44−3.8%3.82<0.0005 *1428<0.0005 *
valueGPR10691−13.8%7.13<0.0005 *728<0.0005 *
316tonsMLP1514−6.6%1.320.18925130.967
valueSVR6149−20.2%7.25<0.0005 *657<0.0005 *
321tonsSVR1154961−16.8%18.37<0.0005 *14<0.0005 *
valueRFR443419−5.4%4.67<0.0005 *1269<0.0005 *
322tonsSVR709693−2.2%3.280.001 *18830.027 *
valueGPR691691−0.1%1.090.27822630.368
323tonsGPR135133−2.0%5.03<0.0005 *852<0.0005 *
valueSVR253244−3.6%1.580.11822430.332
324tonsGPR10,03110,017−0.1%1.380.1723050.449
valueGPR44763775−15.7%11.84<0.0005 *175<0.0005 *
325tonsGPR54425437−0.1%1.220.22621940.255
valueGPR39603957−0.1%2.340.021 *19520.049 *
326tonsSVR260257−1.0%1.820.07220160.08
valueSVR868864−0.4%0.810.41823210.483
327tonsDTR44364119−7.1%4.22<0.0005 *1347<0.0005 *
valueSVR364355−2.4%4.55<0.0005 *1037<0.0005 *
331tonsGPR12761244−2.5%7.74<0.0005 *701<0.0005 *
valueGPR12981270−2.2%7.75<0.0005 *701<0.0005 *
332tonsSVR692615−11.1%5.62<0.0005 *989<0.0005 *
valueSVR11501114−3.1%2.010.047 *23230.487
333tonsSVR353284−19.6%9.52<0.0005 *159<0.0005 *
valueSVR18011782−1.0%2.170.032 *18600.022 *
334tonsGPR4342−3.3%6.20<0.0005 *889<0.0005 *
valueOLS2181------
335tonsGBR132127−3.6%1.940.05519100.034 *
valueSVR656639−2.6%3.240.002 *24180.713
336tonsGPR14831049−29.3%6.61<0.0005 *1018<0.0005 *
valueSVR59105813−1.6%0.870.38721430.189
337tonsGPR6161−0.3%0.890.37625080.953
valueSVR304269−11.6%10.23<0.0005 *261<0.0005 *
339tonsGPR3938−2.4%6.02<0.0005 *856<0.0005 *
valueRFR853717−16.0%7.52<0.0005 *696<0.0005 *
4231tonsSVR612576−5.9%4.14<0.0005 *1496<0.0005 *
valueLasso201920180.0%2.160.033 *22330.315
4232tonsOLS178------
valueSVR371335−9.8%3.450.001 *23230.487
4234tonsGPR168165−1.6%2.280.025 *1476<0.0005 *
valueSVR17651697−3.9%3.98<0.0005 *1463<0.0005 *
4236tonsGPR264261−1.4%3.95<0.0005 *15360.001 *
valueOLS2458------
4238tonsSVR861849−1.4%2.340.021 *1502<0.0005 *
valueSVR18921817−4.0%4.53<0.0005 *1040<0.0005 *
4241tonsGPR311306−1.9%2.840.005 *1502<0.0005 *
valueGPR580560−3.5%6.63<0.0005 *646<0.0005 *
4242tonsSVR218212−3.1%3.500.001 *18190.015 *
valueSVR44063845−12.7%6.13<0.0005 *940<0.0005 *
4244tonsGPR11511056−8.3%7.66<0.0005 *513<0.0005 *
valueGPR17371639−5.6%6.49<0.0005 *628<0.0005 *
4247tonsGPR15,04514,784−1.7%2.260.026 *15940.001 *
valueGPR76667344−4.2%4.15<0.0005 *1416<0.0005 *
4541tonsSVR131121−7.4%5.39<0.0005 *1185<0.0005 *
valueSVR15961552−2.8%2.110.037 *22100.279
4931tonsGPR11311014−10.3%6.12<0.0005 *863<0.0005 *
valueGPR43583964−9.0%8.70<0.0005 *586<0.0005 *
5111tonsGBR4340−6.5%1.980.05123140.468
valueGPR204201−1.8%4.63<0.0005 *18710.025 *
551114tonsRFR1012984−2.8%3.93<0.0005 *1302<0.0005 *
valueSVR11941157−3.1%7.04<0.0005 *609<0.0005 *
(* p-value < 0.05)
OLS: Ordinary Least Squares Regression, Lasso: Least Absolute Shrinkage and Selection Operator, DTR: Decision Tree Regression, RFR: Random Forest Regression, GBR: Gradient Boosting Regression, SVR: Support Vector Regression, GPR: Gaussian Process Regression, MLP: Multi-layer Perceptron.
Table 7. Final Freight Generation Model Selection.
Table 7. Final Freight Generation Model Selection.
NAICSMeasureShipments by Origins
(Freight Production)
Shipments by Destinations
(Freight Attraction)
ModelLogESTABEMPPAYANNRCPTOTModelLogESTABEMPPAYANNRCPTOT
212tonsSVRNo RFRYes
valueSVRNo SVRYes
311tonsGPRYes GPRNo
valueSVRYes SVRNo
312tonsSVRNo SVRYes
valueSVRYesSVRYes
313tonsOLSNo GPRYes
valueSVRNo SVRYes
314tonsGPRNo GPRNo
valueSVRNo GPRNo
315tonsOLSNo GPRYes
valueOLSNo GPRYes
316tonsGPRYes MLPYes
valueSVRNo SVRNo
321tonsRFRNo SVRYes
valueSVRYes RFRYes
322tonsSVRYes SVRNo
valueSVRNoGPRNo
323tonsSVRYes GPRYes
valueSVRYes SVRYes
324tonsSVRNoGPRNo
valueSVRNoGPRNo
325tonsGPRYes GPRNo
valueSVRNo GPRNo
326tonsSVRNo SVRNo
valueSVRNo SVRNo
327tonsSVRNo DTRYes
valueSVRYes SVRNo
331tonsSVRYes GPRYes
valueSVRNoGPRYes
332tonsSVRNo SVRNo
valueSVRYes SVRNo
333tonsSVRNo SVRNo
valueSVRNo SVRNo
334tonsSVRNo GPRNo
valueOLSNo OLSNo
335tonsSVRYes GBRYes
valueOLSNo SVRNo
336tonsOLSYes GPRNo
valueSVRNoSVRNo
337tonsSVRNo GPRYes
valueOLSNo SVRYes
339tonsDTRYes GPRNo
valueSVRNo RFRYes
4231tonsOLSYes SVRYes
valueOLSYes LassoNo
4232tonsSVRNo OLSYes
valueSVRNo SVRNo
4233tonsGBRNo N/A
valueSVRNo
4234tonsSVRYes GPRYes
valueOLSYes SVRNo
4235tonsSVRNo N/A
valueOLSNo
4236tonsLassoYes GPRYes
valueSVRNoOLSYes
4237tonsOLSYes N/A
valueSVRNo
4238tonsSVRYes SVRYes
valueLassoYes SVRYes
4239tonsLassoYes N/A
valueLassoNo
4241tonsOLSYes GPRYes
valueOLSYes GPRYes
4242tonsSVRNo SVRYes
valueOLSYes SVRYes
4243tonsOLSNo N/A
valueSVRNo
4244tonsOLSNo GPRYes
valueOLSNo GPRYes
4245tonsRFRYes N/A
valueDTRNo
4246tonsOLSYes N/A
valueOLSNo
4247tonsOLSYes GPRYes
valueSVRYes GPRYes
4248tonsOLSNo N/A
valueSVRNo
4249tonsSVRYesN/A
valueSVRYes
4541tonsSVRYes SVRNo
valueSVRYes SVRNo
4931tonsSVRYes GPRNo
valueSVRYes GPRNo
5111tonsSVRYes GBRYes
valueSVRYes GPRYes
45431tonsSVRNo N/A
valueOLSNo
551114tonsRFRYes RFRYes
valueDTRYes SVRYes
(√: the variable is included in the final model)
OLS: Ordinary Least Squares Regression, Lasso: Least Absolute Shrinkage and Selection Operator, DTR: Decision Tree Regression, RFR: Random Forest Regression, GBR: Gradient Boosting Regression, SVR: Support Vector Regression, GPR: Gaussian Process Regression, MLP: Multi-layer Perceptron.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lim, H.; Uddin, M.; Liu, Y.; Chin, S.-M.; Hwang, H.-L. A Comparative Study of Machine Learning Algorithms for Industry-Specific Freight Generation Model. Sustainability 2022, 14, 15367. https://doi.org/10.3390/su142215367

AMA Style

Lim H, Uddin M, Liu Y, Chin S-M, Hwang H-L. A Comparative Study of Machine Learning Algorithms for Industry-Specific Freight Generation Model. Sustainability. 2022; 14(22):15367. https://doi.org/10.3390/su142215367

Chicago/Turabian Style

Lim, Hyeonsup, Majbah Uddin, Yuandong Liu, Shih-Miao Chin, and Ho-Ling Hwang. 2022. "A Comparative Study of Machine Learning Algorithms for Industry-Specific Freight Generation Model" Sustainability 14, no. 22: 15367. https://doi.org/10.3390/su142215367

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop