MSCL-Attention: A Multi-Scale Convolutional Long Short-Term Memory (LSTM) Attention Network for Predicting CO2 Emissions from Vehicles

Xie, Yi; Liu, Lizhuang; Han, Zhenqi; Zhang, Jialu

doi:10.3390/su16198547

Open AccessArticle

MSCL-Attention: A Multi-Scale Convolutional Long Short-Term Memory (LSTM) Attention Network for Predicting CO₂ Emissions from Vehicles

by

Yi Xie

,

Lizhuang Liu

^*,

Zhenqi Han

and

Jialu Zhang

Intelligent Information Center, Shanghai Advanced Research Institute, Chinese Academy of Sciences, No. 99, Haike Road, Zhangjiang Hi-Tech Park, PuDong, Shanghai 201210, China

^*

Author to whom correspondence should be addressed.

Sustainability 2024, 16(19), 8547; https://doi.org/10.3390/su16198547

Submission received: 4 September 2024 / Revised: 21 September 2024 / Accepted: 27 September 2024 / Published: 1 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

The transportation industry is one of the major sources of energy consumption and CO₂ emissions, and these emissions have been increasing year by year. Vehicle exhaust emissions have had serious impacts on air quality and global climate change, with CO₂ emissions being one of the primary causes of global warming. In order to accurately predict the CO₂ emission level of automobiles, an MSCL-Attention model based on a multi-scale convolutional neural network, long short-term memory network and multi-head self-attention mechanism is proposed in this study. By combining multi-scale feature extraction, temporal sequence dependency processing, and the self-attention mechanism, the model enhances the prediction accuracy and robustness. In our experiments, the MSCL-Attention model is benchmarked against the latest state-of-the-art models in the field. The results indicate that the MSCL-Attention model demonstrates superior performance in the task of CO₂ emission prediction, surpassing the leading models currently available. This study provides a new method for predicting vehicle exhaust emissions, with significant application prospects, and is expected to contribute to reducing global vehicle emissions, improving air quality, and addressing climate change.

Keywords:

transportation industry; CO₂ emissions; MSCL-Attention model; prediction tasks; climate change

1. Introduction

The concentration of carbon dioxide in the atmosphere has increased significantly over the past hundred years and is increasing at a rate of 2 ppm per year [1]. With the annual increase in atmospheric greenhouse gas emissions, global warming has become the most serious environmental problem faced by humankind [2]. Excessive CO₂ emissions not only lead to extreme weather events such as droughts but also continue to impact natural ecosystems and socio-economic conditions [3]. To address the challenge of the greenhouse effect, the international community has established policies such as the Kyoto Protocol and the Paris Agreement, calling on all nations to work together to reduce carbon emissions [4].

According to data released by the International Energy Agency (IEA), the transportation industry has been one of the major sources of energy consumption and CO₂ emissions from 1990 to 2021, as shown in Figure 1. In 2021, the global CO₂ emissions from the transport sector reached 7631.5 Mt, accounting for 22.7%, ranking second in the world [5]. Additionally, over the past few decades, CO₂ emissions from the transport sector have grown the fastest, and this share is expected to increase to 41% by 2030 [6]. The construction of infrastructure such as highways, high-speed railways, and urban rail systems has directly increased energy consumption and carbon emissions by boosting transportation activities [7]. Furthermore, the entire lifecycle of transportation infrastructure is associated with the consumption of large amounts of fossil fuels [8]. Transportation will not only produce NO_X, CO, PM2.5 and other atmospheric pollutants but also CO₂, CH₄ and N₂O and other greenhouse gasses [9]. These emissions can lead to smoggy weather, photochemical smog, and the greenhouse effect [10]. Prolonged exposure to pollutants can induce respiratory system inflammation, fibrosis, and adverse immune responses, posing a threat to human health [11]. The World Health Organization (WHO) estimates that air pollution accounts for approximately 6.9 million deaths globally. The latest update from the Global Burden of Disease (GBD) study identifies particulate matter (PM) as the fourth largest cause of mortality among 85 assessed risk factors, being responsible for over 5 million deaths in 2017 [12]. Therefore, reducing transportation emissions is of great importance for environmental protection and safeguarding human health.

With the rapid development of artificial intelligence technology, the research on carbon emission prediction methods has also transitioned from traditional approaches to machine learning and deep learning. Traditional methods primarily involve statistical models, which are the most commonly used in carbon emission prediction. Gray system theory [13], specifically designed for systems with partial unknowns, small samples, and limited information, has been widely applied. As traditional gray prediction models have been continuously used and updated, many improved gray models have been proposed. Liao et al. [14] established a comprehensive prediction model for green traffic by using the gray correlation analysis method. In addition, some methods rely on data sampling and GPS technology. Wyatt et al. [15] emphasized that to accurately estimate vehicle emissions in real-world driving, road grade quantification per second is essential. Therefore, they proposed a simplified road grade estimation method based on Light Detection and Ranging (LiDAR) and the Geographic Information System (GIS) and used the PHEM emission model to estimate CO₂ emissions. Classical regression methods have also been applied for carbon emission prediction, including stepwise multiple regression modeling [16] and linear regression [17]. There are also time series methods used to predict future carbon emissions, such as the STIRPAT-based model designed by Li et al. [18] to predict China’s transportation sector’s carbon emissions over the next few decades. Although traditional methods are intuitive, they may struggle to comprehensively consider various influencing factors when dealing with complex, high-dimensional carbon emission problems.

Machine learning and deep learning have emerged as prominent research areas in the development of carbon emission prediction models, owing to their high efficiency and accuracy [19]. The backpropagation neural network (BPNN) [20], trained using the error backpropagation algorithm, has seen increasing application in carbon emission forecasting and has achieved considerable success. However, despite its simplicity, the BPNN is susceptible to becoming trapped in local optima and exhibits limited capacity in handling large-scale, complex datasets. The support vector machine (SVM) [21], a powerful machine learning technique grounded in statistical learning theory, has been widely adopted for regression estimation. Nevertheless, SVMs are highly sensitive to the size of the dataset and often incur significant computational costs and extended training times when applied to large or high-dimensional datasets. Random forests (RFs) [22] have demonstrated strong performance in selecting influential factors, effectively identifying key determinants of carbon emissions. However, when applied to time series data, RFs are less effective compared to specialized temporal models. Moreover, Doreswam et al. [23] employed linear allocation and convolutional autoencoders to construct a clustering model for spatiotemporal analysis, enabling the identification of patterns and trends within datasets to investigate the distribution and impacts of air pollutants. Various neural network architectures have also been explored in this domain, including artificial neural networks (ANNs) [24], convolutional neural networks (CNNs) [25], and long short-term memory (LSTM) networks, as well as bidirectional LSTMs (BiLSTMs), which are capable of capturing temporal dependencies [26]. Despite the successes of these approaches, they present notable limitations in the context of carbon emission prediction. ANNs require high-quality data to achieve optimal performance, and while CNNs excel at capturing local features, their effectiveness diminishes when handling long sequences. LSTM and BiLSTM networks, though adept at modeling temporal dependencies, involve prolonged training times and demand substantial computational resources, making them prone to issues such as vanishing or exploding gradients.

In addition, the combined prediction models are also of great significance in carbon emission prediction. These models primarily integrate multiple individual prediction models to create a unified model [27]. For instance, Meng et al. [28] combined the GM(1,1) prediction equation with a linear model to create a hybrid equation for predicting China’s carbon emissions. This represents a combination of two statistical methods, as well as combinations of statistical methods with machine learning and machine learning with machine learning. Liu et al. [29] developed a combined model based on gray relational analysis and bidirectional long short-term memory (GRA-BiLSTM) to predict carbon emissions from new energy vehicles. Fei et al. [30] proposed a hybrid deep learning framework that combines temporal convolutional networks (TCNs), CNNs, LSTM, and an autoregressive (AR) decomposition model for predicting long-term and short-term multivariate vehicle exhaust emissions. Machine learning and some ensemble methods rely on the quality and quantity of the data. Due to the inherent unpredictability and imbalance of meteorological factors and traffic flow, vehicle exhaust emissions exhibit significant non-stationary and non-linear characteristics. Therefore, we need a more efficient and reliable approach to predict vehicle carbon emissions, which will help prevent air pollution. While ensemble prediction models exhibit notable advantages in terms of carbon emission forecasting, they are often characterized by a complex construction process, and the optimization and tuning of these models present considerable challenges.

In summary, while machine learning and ensemble methods exhibit considerable potential in the realm of carbon emission forecasting, their efficacy is largely contingent upon the quality and quantity of the underlying data, which poses challenges when managing complex and imbalanced datasets. The unpredictable and non-stationary characteristics of meteorological variables and traffic flow contribute to significant non-linear dynamics in vehicular exhaust emissions, highlighting the limitations of existing models in addressing these complexities. Therefore, there is an urgent need for the development of a more efficient and robust predictive framework to effectively contend with the uncertainties inherent in automotive carbon emissions, thereby enhancing strategies to mitigate atmospheric pollution.

This study proposes a multi-scale feature fusion network based on a CNN, LSTM, and attention mechanisms (MSCL-Attention) for vehicle carbon emission prediction. The CNN is used for feature extraction, leveraging multiple convolutional kernels to capture vehicle features at different scales, such as the engine displacement and the number of cylinders. The LSTM layer captures the long-term dependencies in the time series data, identifying patterns in how vehicle features change over time. The multi-head attention mechanism further applies weighted processing to the extracted multi-scale features, highlighting key features and suppressing noise information. Experimental results from an actual vehicle carbon emission dataset demonstrate that this multi-scale feature fusion network achieves higher prediction accuracy compared to single models and effectively captures complex relationships between vehicle features.

The rest of this article is organized as follows. The second part mainly preprocesses and analyzes the dataset, and it introduces the methods used. The third part explains and discusses the results obtained from the model. In the fourth part, this method is evaluated and compared with other algorithms. Finally, the fifth part of this paper summarizes the research and looks forward to future research.

2. Materials and Methods

This study is divided into three key parts, as shown in Figure 2, including data preprocessing, model design, and result analysis. The specific content is presented in the corresponding sections.

2.1. Dataset Description

The data used in this study were obtained from an open-source dataset on the Kaggle website. This dataset comes from the Canadian government’s open data portal and provides detailed information on how vehicle CO₂ emissions vary with different features. It contains seven years of data from 2014 to 2020, with a total of 7385 rows and 12 columns. Table 1 shows the specific features of the dataset.

2.2. Data Analysis

First, we check whether there are any missing values in the dataset and whether the data are unique. We find that there are no null values, and each feature value for every data entry is unique. Further, the feature variables are visually analyzed by drawing a boxplot. Boxplots can reflect the central location and dispersion range of one or more groups of continuous quantitative data distributions. In this study, we plot a boxplot for each numerical feature variable to better understand the distribution characteristics of these variables and their potential relationships with CO₂ emissions. The boxplots are shown in Figure 3. Meanwhile, Table 2 presents the statistical analysis of the data features.

The box portion of the boxplot represents the interquartile range (IQR), with the median (i.e., the 50th percentile of the data) shown as a line. The upper and lower edges of the box correspond to the lower quartile (Q1, 25th percentile) and the upper quartile (Q3, 75th percentile), respectively. The “whiskers” of the boxplot extend to the minimum and maximum values of the data. The entire box contains 50% of the dataset, and a narrower box indicates that the data distribution is more concentrated. The individual points outside the “whiskers” represent outliers in the data. Such outliers may be anomalies, but they can also represent important information and should not be dismissed as mere errors or noise.

The statistical analysis of the dataset allows for a more nuanced understanding of the distribution characteristics of each feature variable and their potential impact on CO₂ emissions. As demonstrated in Table 2, the proximity of the mean and median values across the features suggests a general symmetry, while the standard deviation provides insights into the variability within the dataset. Notably, features such as “Engine Size (L)” and “Cylinders” exhibit lower standard deviations, indicating a higher concentration of values, whereas “Fuel Consumption Comb (mpg)” and “CO₂ Emissions (g/km)” present higher standard deviations, reflecting significant variability. Additionally, the distribution of most features appears to be right-skewed, particularly those associated with fuel consumption and CO₂ emissions, which may indicate a tendency toward higher values for these variables.

In the in-depth analysis of the relationships among various feature variables, correlation analysis serves as a fundamental approach. By computing the correlation coefficients between different variables, we can discern which features exert a significant influence on CO₂ emissions, thereby informing potential model refinements. To facilitate a more intuitive understanding of these inter-variable relationships, a heatmap was constructed to illustrate the correlations among the numerical features, as depicted in Figure 4. Categorical features were encoded to integrate all the variables into the correlation heatmap. Within this heatmap, the shading of each cell denotes the magnitude of the correlation coefficient between pairs of variables, providing a visual representation of the strength and direction of these associations.

Upon examination, it is apparent that features such as the Engine Size, Cylinders, and Fuel Consumption indicators (City, Highway, Combined) exhibit a pronounced positive correlation with CO₂ emissions. In contrast, the Fuel Consumption Comb (mpg) shows a significant negative correlation with CO₂ emissions, underscoring that higher fuel efficiency typically results in reduced emissions. Moreover, categorical variables such as the Make, Model, Vehicle Class, Transmission, and Fuel Type also demonstrate a discernible influence on CO₂ emissions.

2.3. Data Processing

To obtain more precise carbon emission fitting and prediction results, we undertook rigorous data preprocessing, encompassing both outlier management and normalization. The meticulous treatment of outliers is essential for mitigating the impact of extreme values that could potentially distort the model’s accuracy. Normalization was employed to standardize all the feature values within a uniform scale, thereby ensuring dimensional consistency across features and enhancing the stability of the data. Furthermore, to maximize the utility of the categorical information, we encoded the categorical variables “Transmission” and “Fuel Type” into numerical representations. Subsequently, all the numerical features were incorporated into the analysis.

2.3.1. Isolation Forest

Isolation Forest (iForest) [31] is an unsupervised anomaly detection algorithm grounded in tree-based structures. This approach constructs an ensemble of random decision trees to isolate individual data points, utilizing the path length of these points within the trees as a criterion for anomaly detection. In contrast to conventional anomaly detection methodologies that are predicated on density or distance metrics, iForest exploits the principle that anomalies are more readily isolated due to their distinctiveness within the dataset. This method is characterized by its linear time complexity and minimal memory consumption, making it both computationally efficient and scalable [32].

The fundamental concept underlying iForest is that anomalies are more susceptible to isolation compared to normal data points. In the framework of random decision trees, anomalies, due to their relative isolation from the bulk of the data, are more easily segregated during the random selection of features and partitioning points. Consequently, anomalies tend to achieve shallower isolation depths. By constructing an ensemble of such decision trees, iForest evaluates the average isolation depth across these trees as a metric for anomaly detection. Points exhibiting shallower isolation depths are more likely to be classified as anomalies, reflecting their inherent tendency to be more readily isolated from the majority of the data.

The algorithmic procedure of the Isolation Forest can be delineated as follows:

(1)

Construction of the Random Forest. Given a dataset X, initiate the process by generating multiple bootstrap samples. Each bootstrap sample comprises a random subset of the original dataset, typically of equal size. For each bootstrap sample, construct a random decision tree through the subsequent steps.

(a): Randomly select a feature from the dataset.
(b): For the chosen feature, randomly determine a split point, thereby partitioning the data into two distinct subsets.
(c): Recursively apply this procedure to each subset until either each subset is reduced to a single data point or the predefined maximum tree depth is attained.

(2): Isolation Depth Calculation. For each data point x, the algorithm traverses all the decision trees to determine its path length $h (x)$ , which is defined as the number of splits required to isolate $x$ , corresponding to the distance from the root node to the leaf node where x resides. The average path length $E (h (x))$ across all the trees is used as a measure of the isolation depth of the data point. The path length $h (x)$ quantifies the number of divisions needed to separate the data point x within a tree. Given that anomalies are generally more distinct from other data points, the random splits are more likely to isolate them early in the tree, resulting in relatively shorter path lengths, thereby indicating their anomalous nature.
(3): Anomaly Score Determination. Leveraging the path lengths, the algorithm computes an anomaly score for each data point. A normalization factor $c (n)$ is introduced to normalize the path lengths across datasets of different sizes, thereby ensuring that the anomaly scores remain consistent and comparable. For a dataset of size n, the normalization factor $c (n)$ is mathematically expressed as follows:

$c (n) = 2 H (n - 1) - \frac{2 (n - 1)}{n}$

(1)

where $H (i)$ represents the $i$ -th harmonic number, which is calculated as follows:

$H (i) = \sum_{k = 1}^{i} \frac{1}{k}$

(2)

The anomaly score is derived by integrating the average path length

E (h (x))

with the normalization factor

c (n)

. This score quantifies the extent to which a data point is isolated within the ensemble of trees, with a higher score indicating a stronger likelihood of the point being anomalous. The anomaly score

s (x, n)

is formally defined as follows:

s (x, n) = 2^{- \frac{E (h (x))}{c (n)}}

(3)

An anomaly score

s (x, n)

approaching 1 suggests a high probability of the data point x being an anomaly, whereas a score near 0.5 indicates that the data point is more likely to be part of the normal data distribution. Following the application of the Isolation Forest, the processed dataset comprises 6192 instances.

2.3.2. Normalization

To enhance the precision of the model training and prediction, and to mitigate the scale discrepancies among different features, we utilize the min–max normalization technique. Min–max normalization is a data preprocessing approach that systematically scales features to a predefined range, most commonly [0, 1]. The mathematical formulation of this normalization process is as follows:

x_{i}^{'} = \frac{x_{i} - x_{m i n}}{x_{m a x} - x_{m i n}}

(4)

where

x_{i}

denotes the original value of the feature, while

x_{m i n}

and

x_{m a x}

represent the minimum and maximum values that the feature can assume within the dataset, respectively. The normalized value

x_{i}^{'}

is consequently scaled to fall within the range [0, 1]. This normalization process effectively standardizes the scales of all the features, ensuring that the model’s sensitivity to each feature is equitably balanced during both training and prediction. By mitigating biases that may arise from differing feature magnitudes, this method enhances the overall performance and predictive accuracy of the model.

2.4. Prediction Models

In this study, a hybrid network architecture has been developed that integrates a multi-scale CNN with LSTM networks, augmented by a multi-head self-attention mechanism. The following sections will provide a comprehensive analysis of these network components and their synergistic interplay.

2.4.1. CNN

CNNs represent a sophisticated class of deep neural networks distinguished by their convolutional architecture, which significantly minimizes the memory requirements of deep network implementations. In a CNN, local features from the image data are methodically extracted through convolutional layers, while pooling layers are employed to reduce the spatial dimensions of these features, thereby preserving essential information while decreasing the computational complexity. Ultimately, fully connected layers are utilized to execute classification or regression tasks. The architecture of the CNN model applied in this study is depicted in Figure 5.

The convolutional layer employs filters, or convolutional kernels, to extract feature information from specific localized regions within the input data. These filters consist of weight matrices that encode particular features. As these filters traverse the input data, they produce feature maps, where each output neuron is exclusively connected to a defined local region of the preceding layer, referred to as the receptive field. To capture a diverse range of features, convolutional layers deploy multiple filters, with the number of filters in each layer corresponding to the number of output channels. Given that the convolution operation is fundamentally linear, activation functions—such as the rectified linear unit (ReLU)—are typically integrated within the convolutional layer to infuse nonlinearity into the network. The mathematical formulation of the convolution operation is presented as follows:

Y (i, j) = \sum_{m} \sum_{n} X (i + m, j + n) K (m, n) + b

(5)

where

Y (i, j)

denotes the value of the feature map resulting from the convolution operation,

X (i + m, j + n)

represents the value within a specific local region of the input feature map,

K (m, n)

corresponds to the weights of the convolutional kernel, and

b

is the bias term.

The pooling layer executes regional down sampling on the input signal, thereby achieving the dimensionality reduction of the extracted features and enabling the capture of broader patterns. This process not only diminishes the spatial dimensions of the feature maps but also augments the model’s resilience to minor positional variations of the features, thereby enhancing its robustness. The mathematical expressions for both the max pooling and average pooling operations are provided below:

Y (i, j) = \max_{0 \leq m < p, 0 \leq n < p} X (i \cdot s + m, j \cdot s + n)

(6)

Y (i, j) = \frac{1}{p^{2}} \sum_{0 \leq m < p, 0 \leq n < p} X (i \cdot s + m, j \cdot s + n)

(7)

where

X (i \cdot s + m, j \cdot s + n)

represents the values within the pooling window, sss denotes the stride, which specifies the step length by which the window moves across the input feature map, and

Y (i, j)

corresponds to the value in the feature map post-pooling.

The fully connected layer represents the final critical component of a CNN, responsible for mapping the extracted features to the ultimate output space, wherein every input neuron is connected to every output neuron. This layer commonly employs the softmax function to produce the final output for classification tasks or to derive predictions for regression tasks.

Over the past decade, CNNs have emerged as the standard paradigm for a wide array of computer vision and machine learning tasks. To adapt a CNN for one-dimensional signal processing, the concept of a one-dimensional CNN (1D CNN) was introduced. This innovation has demonstrated state-of-the-art performance across various applications, including personalized biomedical data classification and early diagnosis, structural health monitoring, and the detection and recognition of anomalies in power electronics and electric motor fault detection [33]. Unlike traditional the CNN, the filters in the 1D CNN are designed to move along a single axis, necessitating only straightforward array operations for processing one-dimensional signals [34].

2.4.2. LSTM

Recurrent neural networks (RNNs) encounter significant challenges, such as short-term memory limitations and gradient explosion, when handling long sequences. The advent of LSTM networks has effectively mitigated these issues. LSTM networks introduce a sophisticated mechanism comprising forget gates, input gates, and output gates to precisely regulate the flow of information. This design empowers the network to retain and utilize critical information across extended temporal spans, thus enhancing its ability to capture long-term dependencies. As a result, LSTM networks exhibit greater stability and efficiency in the processing and prediction of long-sequence data. The comprehensive architecture of the LSTM is illustrated in Figure 6.

The forget gate is responsible for determining which information from the previous cell state

C_{t - 1}

should be discarded or retained. The output of the forget gate is a vector with elements in the range of [0, 1], where each element indicates the retention level of the corresponding information. The mathematical formulation is as follows:

f_{t} = σ (W_{f} \cdot [h_{t - 1}, X_{t}] + b_{f})

(8)

where

f_{t}

represents the forget gate’s output, σ denotes the sigmoid activation function,

W_{f}

is the weight matrix associated with the forget gate,

h_{t - 1}

is the hidden state from the preceding time step,

X_{t}

is the input at the current time step, and

b_{f}

is the bias term for the forget gate.

The input gate regulates the extent to which the current input

X_{t}

influences the update of the cell state. This gate consists of two primary components: one computes the candidate values for the memory cell, and the other determines which of these candidate values will be incorporated into the cell state. The mathematical formulation governing this process is described by Equations (9) and (10):

i_{t} = σ (W_{i} \cdot [h_{t - 1}, X_{t}] + b_{i})

(9)

{\tilde{C}}_{t} = t a n h (W_{C} \cdot [h_{t - 1}, X_{t}] + b_{C})

(10)

where

i_{t}

represents the activation of the input gate, and

{\tilde{C}}_{t}

denotes the candidate cell state.

W_{i}

and

W_{C}

are the weight matrices associated with the input gate and candidate cell state, respectively, while

b_{i}

and

b_{C}

are their corresponding bias terms. The hyperbolic tangent function

t a n h

is used to generate the candidate cell state, effectively constraining its values within the range [−1,1].

The cell state

C_{t}

is a central element of the LSTM architecture, encompassing the network’s long-term memory across the entire sequence. The update mechanism for the cell state integrates the effects of both the forget gate and the input gate. The formal expression for the cell state update is given by the following equation:

C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\tilde{C}}_{t}

(11)

where

C_{t}

represents the cell state at the current time step,

C_{t - 1}

denotes the cell state from the preceding time step, and

⊙

signifies the element-wise multiplication (Hadamard product). This formulation ensures that the cell state is updated by combining the retained information from the previous state with new information derived from the current input.

The output gate governs the computation of the hidden state

h_{t}

at the current time step and dictates the information transmitted to the subsequent time step. This gate combines the current cell state with the control signals from the output gate to generate the hidden state. The mathematical formulation for this process is expressed in Equation (12), while the hidden state is given by Equation (13):

o_{t} = σ (W_{o} \cdot [h_{t - 1}, X_{t}] + b_{o})

(12)

h_{t} = o_{t} ⊙ t a n h (C_{t})

(13)

where

o_{t}

represents the output of the output gate,

h_{t}

denotes the hidden state and output at the current time step,

W_{o}

is the weight matrix associated with the output gate, and

b_{o}

is the corresponding bias term.

2.4.3. Multi-Head Self-Attention

The attention mechanism was first introduced by Bahdanau et al. [35], marking a significant advancement in the ability to model long-range dependencies. The mechanism’s success across a variety of tasks highlighted its potential in capturing intricate relationships within sequences. Building on this foundation, Vaswani et al. [36] advanced the field by introducing the self-attention and multi-head self-attention mechanisms, which enhanced the capacity to model complex interactions within sequences more effectively.

In the multi-head self-attention mechanism, the input sequence

X

undergoes a linear transformation to generate the query matrix

Q

, key matrix

K

, and value matrix VVV. This process is formally expressed as follows:

Q = X W_{Q}, K = X W_{k}, V = X W_{v}

(14)

where

W_{Q}

,

W_{k}

, and

W_{v}

represent the trainable weight matrices. The subsequent step involves calculating the similarity scores between the query and key matrices using the scaled dot-product attention mechanism, thereby producing the attention weight matrix:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(15)

where the scaling factor

\sqrt{d_{k}}

is introduced to mitigate the risk of the dot-product values becoming disproportionately large. Following this, multiple attention heads are configured, allowing for parallel computation of the attention across different subspaces:

{h e a d}_{i} = A t t e n t i o n (Q_{i}, K_{i}, V_{i})

(16)

where

i

denotes the index of the

i

-th attention head. The outputs of these attention heads are subsequently concatenated and subjected to a linear transformation:

O = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, \dots, {h e a d}_{h}) W_{o}

(17)

where

W_{o}

refers to the linear mapping matrix applied after concatenation, resulting in the final output matrix

O

.

The multi-head self-attention mechanism, integral to the Transformer architecture, is fundamentally designed to capture intricate dependencies within sequential data. By conducting multiple self-attention operations in parallel across distinct subspaces, this mechanism empowers the model to attend to a wide array of feature patterns. Consequently, it significantly enhances the model’s ability to discern and interpret positional information embedded within the sequence, thereby improving its overall understanding of the data’s underlying structure.

2.4.4. CNN-LSTM

The integration of CNNs and RNNs has led to substantial research advancements across various fields, with particularly notable success in domains such as speech recognition [37]. Addressing the traditional CNN model’s dependency on large training datasets, this study introduces LSTM units into the CNN framework. Initially, the data undergoes processing through convolutional and pooling layers, which are responsible for extracting critical features. These extracted features are subsequently passed to the LSTM layers, which further analyze and uncover latent temporal dynamics, thereby enabling the more thorough and nuanced capture of contextual information within sequential data. The architecture of the model is illustrated in Figure 7.

The CNN-LSTM model adeptly captures the spatial and temporal characteristics inherent in the input data, thereby enabling precise prediction of CO₂ emissions.

2.4.5. MSCL-Attention

The MSCL-Attention model seamlessly integrates the multi-scale CNN, LSTM, and multi-head self-attention mechanisms, enabling the efficient extraction and fusion of features from the input data for enhanced predictive performance. The network architecture is depicted in Figure 8.

In the CNN module, three distinct scales of feature extraction are implemented. The first approach employs direct max pooling to preserve the global features of the data. The second and third approaches utilize convolutional and pooling layers of varying depths to ensure the comprehensive extraction of both global and local features across different levels. These multi-scale feature outputs are concatenated to form a unified representation, which is subsequently refined through a fully connected layer to enhance the feature extraction.

To effectively capture the temporal dependencies inherent in the input sequence, a three-layer stacked LSTM network is employed. The first two LSTM layers retain sequential information, while the final layer produces the definitive temporal feature representation. This feature representation is then integrated with the features extracted from the CNN module, combining spatial and temporal information.

Following this, a multi-head self-attention mechanism is applied to the fused features, enabling the model to capture more complex dependencies. This attention mechanism enhances the model’s capacity to comprehend global information by selectively attending to distinct feature patterns across multiple subspaces, thereby improving its understanding of the input sequence.

The predictive capability of deep learning models is predominantly determined by their hyperparameters and architectural design. These components regulate the model’s complexity, learning efficiency, and capacity for generalization. Table 3 outlines the hyperparameters employed in the MSCL-Attention model, which play a critical role in optimizing its performance.

3. Results

3.1. Evaluation Metrics

To rigorously evaluate the performance of our predictive model, we have employed four critical metrics: mean squared error (MSE), mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R²). These metrics collectively offer a robust assessment of the model’s predictive accuracy and its capacity to generalize to unseen data. The formulas for these metrics are as follows:

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(18)

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(19)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(20)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(21)

where

y_{i}

denotes the actual values,

{\hat{y}}_{i}

represents the predicted carbon dioxide emissions,

n

signifies the sample size, and

\bar{y}

indicates the mean of the actual values. Through the analysis of these metrics, a more nuanced understanding of the model’s strengths and weaknesses can be achieved, thereby enabling informed decisions regarding model selection and refinement. Lower values of the MAE, MSE, and RMSE correspond to reduced predictive errors, signifying enhanced predictive performance. The MAE quantifies the average absolute value of the prediction errors, while the MSE is particularly sensitive to larger discrepancies, and the RMSE offers an error metric consistent with the scale of the data. An R² value approaching 1 reflects a superior fit of the model to the data, indicating its capacity to explain a greater proportion of the variability in the target variable. Elevated R² values are indicative of a model’s robust predictive capability with respect to the data.

3.2. Training Result

We have selected the variables Engine Size (L), Cylinders, Transmission, Fuel Type, Fuel Consumption City (L/100 km), Fuel Consumption Hwy (L/100 km), Fuel Consumption Comb (L/100 km), and Fuel Consumption Comb (mpg) as inputs for the model. Notably, the categorical features Transmission and Fuel Type necessitate encoding. The experiments were conducted in a Python 3.8 environment, with the model training performed on an NVIDIA GeForce RTX 3090 GPU.

In the experimental investigation, a comparative analysis was conducted among the CNN, LSTM, CNN-LSTM, and MSCL-Attention models. Figure 9 delineates the progression of the loss metrics for these models across both the training and validation datasets. The dataset was divided into training and testing subsets in a 7:3 ratio, and the “mean_squared_error” was employed as the loss function.

Furthermore, Table 4 presents the evaluation metrics for each model during the training phase, with each metric representing the average of five experimental trials. The comparative analysis indicates that the CNN model outperforms the LSTM model. In addition, the CNN-LSTM model, which integrates CNN and LSTM architectures, demonstrates a significant performance improvement over the individual CNN and LSTM models. This hybrid model exhibits superior error reduction and enhanced fitting capabilities, achieving an R² value of 0.9946, which reflects a higher degree of predictive accuracy.

The MSCL-Attention model, incorporating multi-scale feature fusion and attention mechanisms, delivers the most outstanding performance. It attains an MAE of 0.000974, with the MSE further reduced to 0.000144 and the RMSE decreasing to 0.01201. These results underscore the model’s advanced capacity for error minimization and precision enhancement. Moreover, the MSCL-Attention model achieves an R² value of 0.9972, indicating that it captures nearly the entirety of the target variable’s variance, thereby further validating its exceptional efficacy in CO₂ emission prediction.

3.3. Testing Result

Figure 10 presents a comparative analysis of the predicted versus actual values for each model on the test set. In the scatter plot, blue points denote the actual values, while red points represent the model predictions. The figure reveals that the MSCL-Attention model achieves superior fitting performance. The predictions of the MSCL-Attention model are notably close to the actual values, underscoring its exceptional accuracy in predicting CO₂ emissions when evaluated on the test data. This observation further validates the accuracy and robustness of the MSCL-Attention model in the domain of automotive CO₂ emission forecasting.

Table 5 provides additional validation of the performance metrics for each model, with each metric representing the mean of five experimental trials. Although the CNN model exhibits superior performance during the training phase, it shows a slight increase in errors on the test set, suggesting potential limitations in its generalization capabilities when faced with unseen data. The LSTM model demonstrates comparable performance to the CNN model on the test set. The CNN-LSTM model effectively mitigates the prediction errors on the test set, enhancing the model’s generalization capabilities while retaining a high level of predictive accuracy.

The MSCL-Attention model exhibits the most exceptional performance on the test set. It achieves an MAE of 0.008283, an MSE of 0.0001374, and an RMSE of 0.01172, with an R² value of 0.9971. These metrics indicate that the MSCL-Attention model not only demonstrates outstanding performance during the training phase but also exhibits robust generalization capabilities and high predictive accuracy on the test set. This further substantiates the model’s efficacy in managing complex data tasks.

In this experiment, the MSCL-Attention model exhibited superior predictive performance attributable to its integration of multi-scale feature fusion and an attention mechanism, significantly surpassing the performance of the alternative models. While the CNN demonstrated exceptional capabilities during the training phase, it exhibited limitations in generalization during the testing phase. Conversely, the LSTM network effectively addresses time series data; however, its overall performance was comparatively suboptimal. The CNN-LSTM model, which synthesizes the advantages of both methodologies, ultimately enhanced the predictive performance.

4. Discussion

Automobile exhaust emissions represent a significant threat to both air quality and global climate stability. The pollutants present in these emissions not only contribute to the formation of smog and exacerbate air pollution but also precipitate a range of respiratory ailments. Additionally, the considerable quantities of carbon dioxide emitted from the combustion of fossil fuels in vehicles intensify global warming, leading to adverse effects such as climate change, an increase in extreme weather events, and rising sea levels. These phenomena pose profound risks to ecological systems and human societies. Consequently, accurate prediction of emission levels is imperative for devising effective strategies to mitigate the adverse impacts of automobile exhaust.

The MSCL-Attention model introduced in this study combines a multi-scale CNN, LSTM networks, and multi-head self-attention mechanisms to offer a highly accurate and robust solution for carbon dioxide emission forecasting. By leveraging feature extraction and fusion across multiple scales, the model adeptly captures intricate patterns in vehicular emission behavior. The LSTM component addresses both short-term and long-term dependencies within temporal data sequences, while the multi-head self-attention mechanism enhances the predictive accuracy by selectively attending to critical features. This integrative approach not only improves the model precision but also strengthens its capability to handle complex emission prediction tasks.

In discussing the findings of this study, it is essential to compare them with other recent research of a similar nature to provide a comprehensive assessment of the MSCL-Attention model. We compared the predictive performance of the MSCL-Attention model with that of other models by examining the RMSE values for the carbon dioxide emission predictions. The results are presented in Table 6.

The experimental results demonstrate that the MSCL-Attention model surpasses the traditional LSTM, BiLSTM, and other prevalent machine learning methodologies in the prediction of carbon dioxide emissions, while also exhibiting a reduced parameter count compared to the X-MARL model. The model’s superiority is particularly manifested in the following dimensions:

(1): Multi-Scale Feature Extraction: In contrast to single-scale convolutional neural networks, the MSCL-Attention model utilizes multi-scale convolutions to extract features across varying levels, thereby enabling the more comprehensive capture of cross-scale information inherent in vehicular emission behaviors and enhancing the model’s generalization capabilities.
(2): Temporal Dependency Management: The LSTM architecture is proficient in modeling both long-term and short-term dependencies within time series data, demonstrating significantly stronger performance relative to traditional machine learning approaches (such as ICSO-SVM and RF-DPSO-BP) in the context of complex time series datasets.
(3): Integration of Self-Attention Mechanism: The incorporation of a multi-head self-attention mechanism amplifies the model’s focus on critical information by effectively identifying salient features within the data. This capability allows the model to assign varying weights to different features during prediction, thereby enhancing the predictive accuracy. Such a mechanism is more adept at capturing the dynamic fluctuations in vehicular emissions than conventional time series models (including LSTM and BiLSTM).

5. Conclusions

The prediction of vehicular carbon dioxide emissions not only enables automobile manufacturers to optimize engine design, transmission systems, and fuel efficiency during the vehicle development phase—thereby facilitating the production of more energy-efficient and low-emission automotive products—but also assists nations in fulfilling their emission reduction commitments as stipulated in international climate agreements, thereby fostering global cooperation in combating climate change.

In the realm of automotive design, manufacturers can utilize the comprehensive predictive data provided by the model to refine vehicle specifications, such as modifying engine configurations, fuel systems, and transmission designs, to reduce emissions and enhance fuel efficiency. Moreover, the model supports the development and evaluation of innovative technologies, including advanced combustion techniques and improved energy utilization, enabling manufacturers to introduce high-performance vehicles that comply with environmental standards.

From an environmental policy standpoint, the predictive capabilities of the model can aid policymakers in establishing more stringent emission standards and targets, while concurrently optimizing relevant policies to ensure their efficacy in real-world contexts. Additionally, the model facilitates comprehensive assessments and long-term planning of environmental policies, thereby supporting the monitoring and adjustment of policy measures to advance sustainable development objectives. By contributing to the reduction of greenhouse gas emissions, mitigating climate change, and safeguarding the environment, the model holds substantial significance in achieving carbon reduction goals within the transportation sector.

The MSCL-Attention model introduced in this study presents a novel and efficient approach for the prediction of vehicular carbon dioxide emissions. This model synergistically integrates multi-scale convolutional (MSCL) networks, long short-term memory (LSTM) networks, and self-attention mechanisms, exhibiting superior efficacy in capturing intricate emission characteristics relative to existing models. The experimental findings reveal that the MSCL-Attention model excels in extracting latent spatiotemporal relationships when processing multidimensional data, thereby significantly enhancing the predictive accuracy and underscoring its considerable potential in the realm of automotive emission forecasting.

Nonetheless, the model exhibits certain limitations. First, the training dataset employed in this study originates from publicly accessible sources, with a relatively narrow scope of features and vehicle types, thus failing to capture the full diversity of vehicles across different countries, regions, and models on a global scale. Consequently, the model’s generalization capabilities have not yet been thoroughly validated on broader datasets. Furthermore, the computational efficiency of the model warrants improvement, particularly when applied to large-scale datasets, where the computational overhead may become significant. Future research should prioritize enhancing the model’s computational efficiency while incorporating more diverse and comprehensive datasets to rigorously assess its global applicability. Such advancements would offer more robust support for reducing vehicular emissions, improving air quality, and addressing the challenges of global climate change.

Author Contributions

Conceptualization, Y.X.; validation, Z.H.; writing—original draft preparation, Y.X.; writing—review and editing, L.L. and J.Z.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

https://www.kaggle.com/datasets/debajyotipodder/co2-emission-by-vehicles (accessed on 12 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kabir, M.; Habiba, U.E.; Khan, W.; Shah, A.; Rahim, S.; Rios-Escalante, P.R.D.l.; Farooqi, Z.-U.-R.; Ali, L.; Shafiq, M. Climate change due to increasing concentration of carbon dioxide and its impacts on environment in 21st century; a mini review. J. King Saud Univ. Sci. 2023, 35, 102693. [Google Scholar] [CrossRef]
Gao, H.; Wang, X.; Wu, K.; Zheng, Y.; Wang, Q.; Shi, W.; He, M. A Review of Building Carbon Emission Accounting and Prediction Models. Buildings 2023, 13, 1617. [Google Scholar] [CrossRef]
Liu, W.; Cai, D.; Nkou, J.J.N.; Liu, W.; Huang, Q. A Survey of Carbon Emission Forecasting Methods Based on Neural Networks. In Proceedings of the 5th Asia Energy and Electrical Engineering Symposium (AEEES), Chengdu, China, 23–26 March 2023; pp. 1546–1551. [Google Scholar]
Wang, Q.; Li, S.; Li, R.; Jiang, F. Underestimated impact of the COVID-19 on carbon emission reduction in developing countries-A novel assessment based on scenario analysis. Environ. Res. 2022, 204, 111990. [Google Scholar] [CrossRef] [PubMed]
IEA. Energy Statistics Data Browser. 2023. Available online: https://www.iea.org/data-and-statistics/data-tools/energy-statistics-data-browser (accessed on 19 August 2024).
Li, W.; Bao, L.; Li, Y.; Si, H.; Li, Y. Assessing the transition to low-carbon urban transport: A global comparison. Resour. Conserv. Recycl. 2022, 180, 106179. [Google Scholar] [CrossRef]
Li, J.; Wang, P.; Ma, S. The impact of different transportation infrastructures on urban carbon emissions: Evidence from China. Energy 2024, 295, 131041. [Google Scholar] [CrossRef]
Han, Y.; Li, H.; Liu, J.; Xie, N.; Jia, M.; Sun, Y.; Wang, S. Life cycle carbon emissions from road infrastructure in China: A region-level analysis. Transp. Res. Part D-Transp. Environ. 2023, 115, 103581. [Google Scholar] [CrossRef]
Xu, M.; Weng, Z.; Xie, Y.; Chen, B. Environment and health co-benefits of vehicle emission control policy in Hubei, China. Transp. Res. Part D Transp. Environ. 2023, 120, 103773. [Google Scholar] [CrossRef]
Li, T.; Yang, H.-L.; Xu, L.-T.; Zhou, Y.-T.; Min, Y.-J.; Yan, S.-C.; Zhang, Y.-H.; Wang, X.-M. Comprehensive treatment strategy for diesel truck exhaust. Environ. Sci. Pollut. Res. 2023, 30, 54324–54332. [Google Scholar] [CrossRef]
Geng, Y.; Cao, Y.; Zhao, Q.; Li, Y.; Tian, S. Potential hazards associated with interactions between diesel exhaust particulate matter and pulmonary surfactant. Sci. Total Environ. 2022, 807, 151031. [Google Scholar] [CrossRef] [PubMed]
Harishkumar, K.S.; Yogesh, K.M.; Gad, I. Forecasting Air Pollution Particulate Matter (PM2.5) Using Machine Learning Regression Models. Procedia Comput. Sci. 2020, 171, 2057–2066. [Google Scholar] [CrossRef]
Deng, J.L. Control-problems of grey systems. Syst. Control Lett. 1982, 1, 288–294. [Google Scholar] [CrossRef]
Liao, J. Prediction method of urban traffic carbon emission reduction rate based on grey relational analysis. In Proceedings of the International Conference on Smart Transportation and City Engineering, Chongqing, China, 26–28 October 2021. [Google Scholar]
Wyatt, D.W.; Li, H.; Tate, J.E. The impact of road grade on carbon dioxide (CO₂) emission of a passenger vehicle in real-world driving. Transp. Res. Part D Transp. Environ. 2014, 32, 160–170. [Google Scholar] [CrossRef]
Abdul-Wahab, S.A.; Al-Rubiei, R.; Al-Shamsi, A. A statistical model for predicting carbon monoxide levels. Int. J. Environ. Pollut. 2003, 19, 209–224. [Google Scholar] [CrossRef]
Singh, S.; Kennedy, C. Estimating future energy use and CO₂ emissions of the world’s cities. Environ. Pollut. 2015, 203, 271–278. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Zhang, Z.; Wang, L. Carbon peak forecast and low carbon policy choice of transportation industry in China: Scenario prediction based on STIRPAT model. Environ. Sci. Pollut. Res. 2023, 30, 63250–63271. [Google Scholar] [CrossRef] [PubMed]
Zhao, Y.; Liu, R.; Liu, Z.; Liu, L.; Wang, J.; Liu, W. A Review of Macroscopic Carbon Emission Prediction Model Based on Machine Learning. Sustainability 2023, 15, 6876. [Google Scholar] [CrossRef]
Chen, Z.; Liu, L.; Li, C. Prediction and Control of Carbon Emissions of Electric Vehicles Based on BP Neural Network under Carbon Neutral Background. In Proceedings of the 2021 International Conference on Neural Networks, Information and Communication Engineering, Qingdao, China, 15 October 2021; Volume 11933, pp. 15–22. [Google Scholar] [CrossRef]
Natarajan, Y.; Wadhwa, G.; Preethaa, K.R.S.; Paul, A. Forecasting Carbon Dioxide Emissions of Light-Duty Vehicles with Different Machine Learning Algorithms. Electronics 2023, 12, 2288. [Google Scholar] [CrossRef]
Jiang, Z.; Wu, L.; Niu, H.; Jia, Z.; Qi, Z.; Liu, Y.; Zhang, Q.; Wang, T.; Peng, J.; Mao, H. Investigating the impact of high-altitude on vehicle carbon emissions: A comprehensive on-road driving study. Sci. Total Environ. 2024, 918, 170671. [Google Scholar] [CrossRef]
Harishkumar, K.S.; Gad, I.; Yogesh, K.M. Spatio-Temporal Clustering Analysis for Air Pollution Particulate Matter (PM2.5) Using a Deep Learning Model. In Proceedings of the 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS), Greater Noida, India, 19–20 February 2021; pp. 529–535. [Google Scholar]
Seo, J.; Park, S. Optimizing model parameters of artificial neural networks to predict vehicle emissions. Atmos. Environ. 2023, 294, 119508. [Google Scholar] [CrossRef]
Hien, N.L.H.; Kor, A.-L. Analysis and Prediction Model of Fuel Consumption and Carbon Dioxide Emissions of Light-Duty Vehicles. Appl. Sci. 2022, 12, 803. [Google Scholar] [CrossRef]
Al-Nefaie, A.H.H.; Aldhyani, T.H.H. Predicting CO₂ Emissions from Traffic Vehicles for Sustainable and Smart Environment Using a Deep Learning Model. Sustainability 2023, 15, 7615. [Google Scholar] [CrossRef]
Jin, Y.; Sharifi, A.; Li, Z.; Chen, S.; Zeng, S.; Zhao, S. Carbon emission prediction models: A review. Sci. Total Environ. 2024, 927, 172319. [Google Scholar] [CrossRef] [PubMed]
Meng, M.; Niu, D.; Shang, W. A small-sample hybrid model for forecasting energy-related CO₂ emissions. Energy 2014, 64, 673–677. [Google Scholar] [CrossRef]
Liu, B.; Wang, S.; Liang, X.; Han, Z. Carbon emission reduction prediction of new energy vehicles in China based on GRA-BiLSTM model. Atmos. Pollut. Res. 2023, 14, 101865. [Google Scholar] [CrossRef]
Fei, X.; Lai, Z.; Fang, Y.; Ling, Q. A dual attention-based fusion network for long- and short-term multivariate vehicle exhaust emission prediction. Sci. Total Environ. 2023, 860, 160490. [Google Scholar] [CrossRef] [PubMed]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation Forest. In Proceedings of the 8th IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008. [Google Scholar]
Yao, C.F.; Ma, X.Q.; Chen, B.; Zhao, X.S.; Bai, G. Distribution Forest: An Anomaly Detection Method Based on Isolation Forest. In Proceedings of the 13th International Symposium on Advanced Parallel Processing Technologies (APPT), Tianjin, China, 15–16 August 2019; pp. 135–147. [Google Scholar]
Kiranyaz, S.; Avci, O.; Abdeljaber, O.; Ince, T.; Gabbouj, M.; Inman, D.J. 1D convolutional neural networks and applications: A survey. Mech. Syst. Signal Process. 2021, 151, 107398. [Google Scholar] [CrossRef]
Liu, L.Y.; Si, Y.W. 1D convolutional neural networks for chart pattern classification in financial time series. J. Supercomput. 2022, 78, 14191–14214. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Ren, H.Q.; Wang, W.Q.; Qu, X.W.; Cai, Y.Q. A new hybrid-parameter recurrent neural network for online handwritten chinese character recognition. Pattern Recognit. Lett. 2019, 128, 400–406. [Google Scholar] [CrossRef]
Li, X.Q.; Zhang, X.X. A comparative study of statistical and machine learning models on carbon dioxide emissions prediction of China. Environ. Sci. Pollut. Res. 2023, 30, 117485–117502. [Google Scholar] [CrossRef]
Wen, L.; Cao, Y. Influencing factors analysis and forecasting of residential energy-related CO₂ emissions utilizing optimized support vector machine. J. Clean. Prod. 2020, 250, 119492. [Google Scholar] [CrossRef]
Wen, L.; Yuan, X.Y. Forecasting CO₂ emissions in Chinas commercial department, through BP neural network based on random forest and PSO. Sci. Total Environ. 2020, 718, 137194. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.L.; Tang, C.L.; Zhou, A.Y.; Yang, K. A novel ensemble approach for road traffic carbon emission prediction: A case in Canada. In Environment, Development and Sustainability; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar] [CrossRef]

Figure 1. CO₂ emissions by sector, world, 1990–2021.

Figure 2. The flow chart of this study.

Figure 3. Boxplot analysis.

Figure 4. Correlation matrix heatmap.

Figure 5. CNN structure.

Figure 6. LSTM structure.

Figure 7. CNN-LSTM structure.

Figure 8. MSCL-Attention structure.

Figure 9. Training and validation loss. (a) CNN. (b) LSTM. (c) CNN-LSTM. (d) MSCL-Attention.

Figure 10. Actual vs. predicted values increasing IDs. (a) CNN. (b) LSTM. (c) CNN-LSTM. (d) MSCL-Attention.

Table 1. The specific characteristics of the dataset.

Variable	Type
Make	Object
Model	Object
Vehicle Class	Object
Engine Size (L)	Float64
Cylinders	Int64
Transmission	Object
Fuel Type	Object
Fuel Consumption City (L/100 km)	Float64
Fuel Consumption Hwy (L/100 km)	Float64
Fuel Consumption Comb (L/100 km)	Float64
Fuel Consumption Comb (mpg)	Int64
CO₂ Emissions (g/km)	Int64

Table 2. Statistical analysis of the numerical features.

	Count	Mean	Std	Min	50%	Max
Engine Size (L)	7385	3.160068	1.354170	0.9	3.0	8.4
Cylinders	7385	5.615030	1.828307	3.0	6.0	16.0
Fuel Consumption City (L/100 km)	7385	12.556534	3.500274	4.2	12.1	30.6
Fuel Consumption Hwy (L/100 km)	7385	9.041706	2.224456	4.0	8.7	20.6
Fuel Consumption Comb (L/100 km)	7385	10.975071	2.892506	4.1	10.6	26.1
Fuel Consumption Comb (mpg)	7385	27.481652	7.231879	11.0	27.0	69.0
CO₂ Emissions (g/km)	7385	250.584699	58.512679	96.0	246.0	522.0

Table 3. MSCL-Attention model parameters.

Layer	Hyperparameters
Conv Layer 1	Filters: 128, Kernel Size: 3
Conv Layer 2_1	Filters: 256, Kernel Size: 3
Conv Layer 2_2	Filters: 512, Kernel Size: 3
Max Pooling Layer	Pool Size: 2
LSTM Layer 1	Unit: 100
LSTM Layer 2	Unit: 100
LSTM Layer 3	Unit: 200
Multi-Head Attention	Heads: 3, Key Dimension: 64
Learning Rate	0.0001
Batch Size	16

Table 4. Model evaluation during the training phase.

Models	MAE	MSE	RMSE	R²
CNN	0.01647	0.0004954	0.02225	0.9902
LSTM	0.01868	0.0005648	0.02376	0.9888
CNN-LSTM	0.01270	0.0002699	0.01642	0.9946
MSCL-Attention	0.000974	0.000144	0.01201	0.9972

Table 5. Model evaluation during the testing phase.

Models	MAE	MSE	RMSE	R²
CNN	0.02051	0.0006793	0.02606	0.9858
LSTM	0.02141	0.0006618	0.02572	0.9862
CNN-LSTM	0.01416	0.0003296	0.01816	0.9931
MSCL-Attention	0.008283	0.0001374	0.01172	0.9971

Table 6. Comparison between the results of different models.

Model	RMSE
BiLSTM [26]	0.03560 (R² = 0.9355)
LSTM [38]	0.0187 (R² = 0.9844)
ICSO-SVM [39]	0.4346
RF-DPSO-BP [40]	1.66
X-MARL [41]	0.01178 (R² = 0.9956)
MSCL-Attention	0.01172 (R² = 0.99714)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, Y.; Liu, L.; Han, Z.; Zhang, J. MSCL-Attention: A Multi-Scale Convolutional Long Short-Term Memory (LSTM) Attention Network for Predicting CO₂ Emissions from Vehicles. Sustainability 2024, 16, 8547. https://doi.org/10.3390/su16198547

AMA Style

Xie Y, Liu L, Han Z, Zhang J. MSCL-Attention: A Multi-Scale Convolutional Long Short-Term Memory (LSTM) Attention Network for Predicting CO₂ Emissions from Vehicles. Sustainability. 2024; 16(19):8547. https://doi.org/10.3390/su16198547

Chicago/Turabian Style

Xie, Yi, Lizhuang Liu, Zhenqi Han, and Jialu Zhang. 2024. "MSCL-Attention: A Multi-Scale Convolutional Long Short-Term Memory (LSTM) Attention Network for Predicting CO₂ Emissions from Vehicles" Sustainability 16, no. 19: 8547. https://doi.org/10.3390/su16198547

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

MSCL-Attention: A Multi-Scale Convolutional Long Short-Term Memory (LSTM) Attention Network for Predicting CO₂ Emissions from Vehicles

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Description

2.2. Data Analysis

2.3. Data Processing

2.3.1. Isolation Forest

2.3.2. Normalization

2.4. Prediction Models

2.4.1. CNN

2.4.2. LSTM

2.4.3. Multi-Head Self-Attention

2.4.4. CNN-LSTM

2.4.5. MSCL-Attention

3. Results

3.1. Evaluation Metrics

3.2. Training Result

3.3. Testing Result

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI