1. Introduction
The concentration of carbon dioxide in the atmosphere has increased significantly over the past hundred years and is increasing at a rate of 2 ppm per year [
1]. With the annual increase in atmospheric greenhouse gas emissions, global warming has become the most serious environmental problem faced by humankind [
2]. Excessive CO
2 emissions not only lead to extreme weather events such as droughts but also continue to impact natural ecosystems and socio-economic conditions [
3]. To address the challenge of the greenhouse effect, the international community has established policies such as the Kyoto Protocol and the Paris Agreement, calling on all nations to work together to reduce carbon emissions [
4].
According to data released by the International Energy Agency (IEA), the transportation industry has been one of the major sources of energy consumption and CO
2 emissions from 1990 to 2021, as shown in
Figure 1. In 2021, the global CO
2 emissions from the transport sector reached 7631.5 Mt, accounting for 22.7%, ranking second in the world [
5]. Additionally, over the past few decades, CO
2 emissions from the transport sector have grown the fastest, and this share is expected to increase to 41% by 2030 [
6]. The construction of infrastructure such as highways, high-speed railways, and urban rail systems has directly increased energy consumption and carbon emissions by boosting transportation activities [
7]. Furthermore, the entire lifecycle of transportation infrastructure is associated with the consumption of large amounts of fossil fuels [
8]. Transportation will not only produce NO
X, CO, PM2.5 and other atmospheric pollutants but also CO
2, CH
4 and N
2O and other greenhouse gasses [
9]. These emissions can lead to smoggy weather, photochemical smog, and the greenhouse effect [
10]. Prolonged exposure to pollutants can induce respiratory system inflammation, fibrosis, and adverse immune responses, posing a threat to human health [
11]. The World Health Organization (WHO) estimates that air pollution accounts for approximately 6.9 million deaths globally. The latest update from the Global Burden of Disease (GBD) study identifies particulate matter (PM) as the fourth largest cause of mortality among 85 assessed risk factors, being responsible for over 5 million deaths in 2017 [
12]. Therefore, reducing transportation emissions is of great importance for environmental protection and safeguarding human health.
With the rapid development of artificial intelligence technology, the research on carbon emission prediction methods has also transitioned from traditional approaches to machine learning and deep learning. Traditional methods primarily involve statistical models, which are the most commonly used in carbon emission prediction. Gray system theory [
13], specifically designed for systems with partial unknowns, small samples, and limited information, has been widely applied. As traditional gray prediction models have been continuously used and updated, many improved gray models have been proposed. Liao et al. [
14] established a comprehensive prediction model for green traffic by using the gray correlation analysis method. In addition, some methods rely on data sampling and GPS technology. Wyatt et al. [
15] emphasized that to accurately estimate vehicle emissions in real-world driving, road grade quantification per second is essential. Therefore, they proposed a simplified road grade estimation method based on Light Detection and Ranging (LiDAR) and the Geographic Information System (GIS) and used the PHEM emission model to estimate CO
2 emissions. Classical regression methods have also been applied for carbon emission prediction, including stepwise multiple regression modeling [
16] and linear regression [
17]. There are also time series methods used to predict future carbon emissions, such as the STIRPAT-based model designed by Li et al. [
18] to predict China’s transportation sector’s carbon emissions over the next few decades. Although traditional methods are intuitive, they may struggle to comprehensively consider various influencing factors when dealing with complex, high-dimensional carbon emission problems.
Machine learning and deep learning have emerged as prominent research areas in the development of carbon emission prediction models, owing to their high efficiency and accuracy [
19]. The backpropagation neural network (BPNN) [
20], trained using the error backpropagation algorithm, has seen increasing application in carbon emission forecasting and has achieved considerable success. However, despite its simplicity, the BPNN is susceptible to becoming trapped in local optima and exhibits limited capacity in handling large-scale, complex datasets. The support vector machine (SVM) [
21], a powerful machine learning technique grounded in statistical learning theory, has been widely adopted for regression estimation. Nevertheless, SVMs are highly sensitive to the size of the dataset and often incur significant computational costs and extended training times when applied to large or high-dimensional datasets. Random forests (RFs) [
22] have demonstrated strong performance in selecting influential factors, effectively identifying key determinants of carbon emissions. However, when applied to time series data, RFs are less effective compared to specialized temporal models. Moreover, Doreswam et al. [
23] employed linear allocation and convolutional autoencoders to construct a clustering model for spatiotemporal analysis, enabling the identification of patterns and trends within datasets to investigate the distribution and impacts of air pollutants. Various neural network architectures have also been explored in this domain, including artificial neural networks (ANNs) [
24], convolutional neural networks (CNNs) [
25], and long short-term memory (LSTM) networks, as well as bidirectional LSTMs (BiLSTMs), which are capable of capturing temporal dependencies [
26]. Despite the successes of these approaches, they present notable limitations in the context of carbon emission prediction. ANNs require high-quality data to achieve optimal performance, and while CNNs excel at capturing local features, their effectiveness diminishes when handling long sequences. LSTM and BiLSTM networks, though adept at modeling temporal dependencies, involve prolonged training times and demand substantial computational resources, making them prone to issues such as vanishing or exploding gradients.
In addition, the combined prediction models are also of great significance in carbon emission prediction. These models primarily integrate multiple individual prediction models to create a unified model [
27]. For instance, Meng et al. [
28] combined the GM(1,1) prediction equation with a linear model to create a hybrid equation for predicting China’s carbon emissions. This represents a combination of two statistical methods, as well as combinations of statistical methods with machine learning and machine learning with machine learning. Liu et al. [
29] developed a combined model based on gray relational analysis and bidirectional long short-term memory (GRA-BiLSTM) to predict carbon emissions from new energy vehicles. Fei et al. [
30] proposed a hybrid deep learning framework that combines temporal convolutional networks (TCNs), CNNs, LSTM, and an autoregressive (AR) decomposition model for predicting long-term and short-term multivariate vehicle exhaust emissions. Machine learning and some ensemble methods rely on the quality and quantity of the data. Due to the inherent unpredictability and imbalance of meteorological factors and traffic flow, vehicle exhaust emissions exhibit significant non-stationary and non-linear characteristics. Therefore, we need a more efficient and reliable approach to predict vehicle carbon emissions, which will help prevent air pollution. While ensemble prediction models exhibit notable advantages in terms of carbon emission forecasting, they are often characterized by a complex construction process, and the optimization and tuning of these models present considerable challenges.
In summary, while machine learning and ensemble methods exhibit considerable potential in the realm of carbon emission forecasting, their efficacy is largely contingent upon the quality and quantity of the underlying data, which poses challenges when managing complex and imbalanced datasets. The unpredictable and non-stationary characteristics of meteorological variables and traffic flow contribute to significant non-linear dynamics in vehicular exhaust emissions, highlighting the limitations of existing models in addressing these complexities. Therefore, there is an urgent need for the development of a more efficient and robust predictive framework to effectively contend with the uncertainties inherent in automotive carbon emissions, thereby enhancing strategies to mitigate atmospheric pollution.
This study proposes a multi-scale feature fusion network based on a CNN, LSTM, and attention mechanisms (MSCL-Attention) for vehicle carbon emission prediction. The CNN is used for feature extraction, leveraging multiple convolutional kernels to capture vehicle features at different scales, such as the engine displacement and the number of cylinders. The LSTM layer captures the long-term dependencies in the time series data, identifying patterns in how vehicle features change over time. The multi-head attention mechanism further applies weighted processing to the extracted multi-scale features, highlighting key features and suppressing noise information. Experimental results from an actual vehicle carbon emission dataset demonstrate that this multi-scale feature fusion network achieves higher prediction accuracy compared to single models and effectively captures complex relationships between vehicle features.
The rest of this article is organized as follows. The second part mainly preprocesses and analyzes the dataset, and it introduces the methods used. The third part explains and discusses the results obtained from the model. In the fourth part, this method is evaluated and compared with other algorithms. Finally, the fifth part of this paper summarizes the research and looks forward to future research.
2. Materials and Methods
This study is divided into three key parts, as shown in
Figure 2, including data preprocessing, model design, and result analysis. The specific content is presented in the corresponding sections.
2.1. Dataset Description
The data used in this study were obtained from an open-source dataset on the Kaggle website. This dataset comes from the Canadian government’s open data portal and provides detailed information on how vehicle CO
2 emissions vary with different features. It contains seven years of data from 2014 to 2020, with a total of 7385 rows and 12 columns.
Table 1 shows the specific features of the dataset.
2.2. Data Analysis
First, we check whether there are any missing values in the dataset and whether the data are unique. We find that there are no null values, and each feature value for every data entry is unique. Further, the feature variables are visually analyzed by drawing a boxplot. Boxplots can reflect the central location and dispersion range of one or more groups of continuous quantitative data distributions. In this study, we plot a boxplot for each numerical feature variable to better understand the distribution characteristics of these variables and their potential relationships with CO
2 emissions. The boxplots are shown in
Figure 3. Meanwhile,
Table 2 presents the statistical analysis of the data features.
The box portion of the boxplot represents the interquartile range (IQR), with the median (i.e., the 50th percentile of the data) shown as a line. The upper and lower edges of the box correspond to the lower quartile (Q1, 25th percentile) and the upper quartile (Q3, 75th percentile), respectively. The “whiskers” of the boxplot extend to the minimum and maximum values of the data. The entire box contains 50% of the dataset, and a narrower box indicates that the data distribution is more concentrated. The individual points outside the “whiskers” represent outliers in the data. Such outliers may be anomalies, but they can also represent important information and should not be dismissed as mere errors or noise.
The statistical analysis of the dataset allows for a more nuanced understanding of the distribution characteristics of each feature variable and their potential impact on CO
2 emissions. As demonstrated in
Table 2, the proximity of the mean and median values across the features suggests a general symmetry, while the standard deviation provides insights into the variability within the dataset. Notably, features such as “Engine Size (L)” and “Cylinders” exhibit lower standard deviations, indicating a higher concentration of values, whereas “Fuel Consumption Comb (mpg)” and “CO
2 Emissions (g/km)” present higher standard deviations, reflecting significant variability. Additionally, the distribution of most features appears to be right-skewed, particularly those associated with fuel consumption and CO
2 emissions, which may indicate a tendency toward higher values for these variables.
In the in-depth analysis of the relationships among various feature variables, correlation analysis serves as a fundamental approach. By computing the correlation coefficients between different variables, we can discern which features exert a significant influence on CO
2 emissions, thereby informing potential model refinements. To facilitate a more intuitive understanding of these inter-variable relationships, a heatmap was constructed to illustrate the correlations among the numerical features, as depicted in
Figure 4. Categorical features were encoded to integrate all the variables into the correlation heatmap. Within this heatmap, the shading of each cell denotes the magnitude of the correlation coefficient between pairs of variables, providing a visual representation of the strength and direction of these associations.
Upon examination, it is apparent that features such as the Engine Size, Cylinders, and Fuel Consumption indicators (City, Highway, Combined) exhibit a pronounced positive correlation with CO2 emissions. In contrast, the Fuel Consumption Comb (mpg) shows a significant negative correlation with CO2 emissions, underscoring that higher fuel efficiency typically results in reduced emissions. Moreover, categorical variables such as the Make, Model, Vehicle Class, Transmission, and Fuel Type also demonstrate a discernible influence on CO2 emissions.
2.3. Data Processing
To obtain more precise carbon emission fitting and prediction results, we undertook rigorous data preprocessing, encompassing both outlier management and normalization. The meticulous treatment of outliers is essential for mitigating the impact of extreme values that could potentially distort the model’s accuracy. Normalization was employed to standardize all the feature values within a uniform scale, thereby ensuring dimensional consistency across features and enhancing the stability of the data. Furthermore, to maximize the utility of the categorical information, we encoded the categorical variables “Transmission” and “Fuel Type” into numerical representations. Subsequently, all the numerical features were incorporated into the analysis.
2.3.1. Isolation Forest
Isolation Forest (iForest) [
31] is an unsupervised anomaly detection algorithm grounded in tree-based structures. This approach constructs an ensemble of random decision trees to isolate individual data points, utilizing the path length of these points within the trees as a criterion for anomaly detection. In contrast to conventional anomaly detection methodologies that are predicated on density or distance metrics, iForest exploits the principle that anomalies are more readily isolated due to their distinctiveness within the dataset. This method is characterized by its linear time complexity and minimal memory consumption, making it both computationally efficient and scalable [
32].
The fundamental concept underlying iForest is that anomalies are more susceptible to isolation compared to normal data points. In the framework of random decision trees, anomalies, due to their relative isolation from the bulk of the data, are more easily segregated during the random selection of features and partitioning points. Consequently, anomalies tend to achieve shallower isolation depths. By constructing an ensemble of such decision trees, iForest evaluates the average isolation depth across these trees as a metric for anomaly detection. Points exhibiting shallower isolation depths are more likely to be classified as anomalies, reflecting their inherent tendency to be more readily isolated from the majority of the data.
The algorithmic procedure of the Isolation Forest can be delineated as follows:
- (1)
Construction of the Random Forest. Given a dataset X, initiate the process by generating multiple bootstrap samples. Each bootstrap sample comprises a random subset of the original dataset, typically of equal size. For each bootstrap sample, construct a random decision tree through the subsequent steps.
- (a)
Randomly select a feature from the dataset.
- (b)
For the chosen feature, randomly determine a split point, thereby partitioning the data into two distinct subsets.
- (c)
Recursively apply this procedure to each subset until either each subset is reduced to a single data point or the predefined maximum tree depth is attained.
- (2)
Isolation Depth Calculation. For each data point x, the algorithm traverses all the decision trees to determine its path length , which is defined as the number of splits required to isolate , corresponding to the distance from the root node to the leaf node where x resides. The average path length across all the trees is used as a measure of the isolation depth of the data point. The path length quantifies the number of divisions needed to separate the data point x within a tree. Given that anomalies are generally more distinct from other data points, the random splits are more likely to isolate them early in the tree, resulting in relatively shorter path lengths, thereby indicating their anomalous nature.
- (3)
Anomaly Score Determination. Leveraging the path lengths, the algorithm computes an anomaly score for each data point. A normalization factor
is introduced to normalize the path lengths across datasets of different sizes, thereby ensuring that the anomaly scores remain consistent and comparable. For a dataset of size n, the normalization factor
is mathematically expressed as follows:
where
represents the
-th harmonic number, which is calculated as follows:
The anomaly score is derived by integrating the average path length
with the normalization factor
. This score quantifies the extent to which a data point is isolated within the ensemble of trees, with a higher score indicating a stronger likelihood of the point being anomalous. The anomaly score
is formally defined as follows:
An anomaly score approaching 1 suggests a high probability of the data point x being an anomaly, whereas a score near 0.5 indicates that the data point is more likely to be part of the normal data distribution. Following the application of the Isolation Forest, the processed dataset comprises 6192 instances.
2.3.2. Normalization
To enhance the precision of the model training and prediction, and to mitigate the scale discrepancies among different features, we utilize the min–max normalization technique. Min–max normalization is a data preprocessing approach that systematically scales features to a predefined range, most commonly [0, 1]. The mathematical formulation of this normalization process is as follows:
where
denotes the original value of the feature, while
and
represent the minimum and maximum values that the feature can assume within the dataset, respectively. The normalized value
is consequently scaled to fall within the range [0, 1]. This normalization process effectively standardizes the scales of all the features, ensuring that the model’s sensitivity to each feature is equitably balanced during both training and prediction. By mitigating biases that may arise from differing feature magnitudes, this method enhances the overall performance and predictive accuracy of the model.
2.4. Prediction Models
In this study, a hybrid network architecture has been developed that integrates a multi-scale CNN with LSTM networks, augmented by a multi-head self-attention mechanism. The following sections will provide a comprehensive analysis of these network components and their synergistic interplay.
2.4.1. CNN
CNNs represent a sophisticated class of deep neural networks distinguished by their convolutional architecture, which significantly minimizes the memory requirements of deep network implementations. In a CNN, local features from the image data are methodically extracted through convolutional layers, while pooling layers are employed to reduce the spatial dimensions of these features, thereby preserving essential information while decreasing the computational complexity. Ultimately, fully connected layers are utilized to execute classification or regression tasks. The architecture of the CNN model applied in this study is depicted in
Figure 5.
The convolutional layer employs filters, or convolutional kernels, to extract feature information from specific localized regions within the input data. These filters consist of weight matrices that encode particular features. As these filters traverse the input data, they produce feature maps, where each output neuron is exclusively connected to a defined local region of the preceding layer, referred to as the receptive field. To capture a diverse range of features, convolutional layers deploy multiple filters, with the number of filters in each layer corresponding to the number of output channels. Given that the convolution operation is fundamentally linear, activation functions—such as the rectified linear unit (ReLU)—are typically integrated within the convolutional layer to infuse nonlinearity into the network. The mathematical formulation of the convolution operation is presented as follows:
where
denotes the value of the feature map resulting from the convolution operation,
represents the value within a specific local region of the input feature map,
corresponds to the weights of the convolutional kernel, and
is the bias term.
The pooling layer executes regional down sampling on the input signal, thereby achieving the dimensionality reduction of the extracted features and enabling the capture of broader patterns. This process not only diminishes the spatial dimensions of the feature maps but also augments the model’s resilience to minor positional variations of the features, thereby enhancing its robustness. The mathematical expressions for both the max pooling and average pooling operations are provided below:
where
represents the values within the pooling window, sss denotes the stride, which specifies the step length by which the window moves across the input feature map, and
corresponds to the value in the feature map post-pooling.
The fully connected layer represents the final critical component of a CNN, responsible for mapping the extracted features to the ultimate output space, wherein every input neuron is connected to every output neuron. This layer commonly employs the softmax function to produce the final output for classification tasks or to derive predictions for regression tasks.
Over the past decade, CNNs have emerged as the standard paradigm for a wide array of computer vision and machine learning tasks. To adapt a CNN for one-dimensional signal processing, the concept of a one-dimensional CNN (1D CNN) was introduced. This innovation has demonstrated state-of-the-art performance across various applications, including personalized biomedical data classification and early diagnosis, structural health monitoring, and the detection and recognition of anomalies in power electronics and electric motor fault detection [
33]. Unlike traditional the CNN, the filters in the 1D CNN are designed to move along a single axis, necessitating only straightforward array operations for processing one-dimensional signals [
34].
2.4.2. LSTM
Recurrent neural networks (RNNs) encounter significant challenges, such as short-term memory limitations and gradient explosion, when handling long sequences. The advent of LSTM networks has effectively mitigated these issues. LSTM networks introduce a sophisticated mechanism comprising forget gates, input gates, and output gates to precisely regulate the flow of information. This design empowers the network to retain and utilize critical information across extended temporal spans, thus enhancing its ability to capture long-term dependencies. As a result, LSTM networks exhibit greater stability and efficiency in the processing and prediction of long-sequence data. The comprehensive architecture of the LSTM is illustrated in
Figure 6.
The forget gate is responsible for determining which information from the previous cell state
should be discarded or retained. The output of the forget gate is a vector with elements in the range of [0, 1], where each element indicates the retention level of the corresponding information. The mathematical formulation is as follows:
where
represents the forget gate’s output, σ denotes the sigmoid activation function,
is the weight matrix associated with the forget gate,
is the hidden state from the preceding time step,
is the input at the current time step, and
is the bias term for the forget gate.
The input gate regulates the extent to which the current input
influences the update of the cell state. This gate consists of two primary components: one computes the candidate values for the memory cell, and the other determines which of these candidate values will be incorporated into the cell state. The mathematical formulation governing this process is described by Equations (9) and (10):
where
represents the activation of the input gate, and
denotes the candidate cell state.
and
are the weight matrices associated with the input gate and candidate cell state, respectively, while
and
are their corresponding bias terms. The hyperbolic tangent function
is used to generate the candidate cell state, effectively constraining its values within the range [−1,1].
The cell state
is a central element of the LSTM architecture, encompassing the network’s long-term memory across the entire sequence. The update mechanism for the cell state integrates the effects of both the forget gate and the input gate. The formal expression for the cell state update is given by the following equation:
where
represents the cell state at the current time step,
denotes the cell state from the preceding time step, and
signifies the element-wise multiplication (Hadamard product). This formulation ensures that the cell state is updated by combining the retained information from the previous state with new information derived from the current input.
The output gate governs the computation of the hidden state
at the current time step and dictates the information transmitted to the subsequent time step. This gate combines the current cell state with the control signals from the output gate to generate the hidden state. The mathematical formulation for this process is expressed in Equation (12), while the hidden state is given by Equation (13):
where
represents the output of the output gate,
denotes the hidden state and output at the current time step,
is the weight matrix associated with the output gate, and
is the corresponding bias term.
2.4.3. Multi-Head Self-Attention
The attention mechanism was first introduced by Bahdanau et al. [
35], marking a significant advancement in the ability to model long-range dependencies. The mechanism’s success across a variety of tasks highlighted its potential in capturing intricate relationships within sequences. Building on this foundation, Vaswani et al. [
36] advanced the field by introducing the self-attention and multi-head self-attention mechanisms, which enhanced the capacity to model complex interactions within sequences more effectively.
In the multi-head self-attention mechanism, the input sequence
undergoes a linear transformation to generate the query matrix
, key matrix
, and value matrix VVV. This process is formally expressed as follows:
where
,
, and
represent the trainable weight matrices. The subsequent step involves calculating the similarity scores between the query and key matrices using the scaled dot-product attention mechanism, thereby producing the attention weight matrix:
where the scaling factor
is introduced to mitigate the risk of the dot-product values becoming disproportionately large. Following this, multiple attention heads are configured, allowing for parallel computation of the attention across different subspaces:
where
denotes the index of the
-th attention head. The outputs of these attention heads are subsequently concatenated and subjected to a linear transformation:
where
refers to the linear mapping matrix applied after concatenation, resulting in the final output matrix
.
The multi-head self-attention mechanism, integral to the Transformer architecture, is fundamentally designed to capture intricate dependencies within sequential data. By conducting multiple self-attention operations in parallel across distinct subspaces, this mechanism empowers the model to attend to a wide array of feature patterns. Consequently, it significantly enhances the model’s ability to discern and interpret positional information embedded within the sequence, thereby improving its overall understanding of the data’s underlying structure.
2.4.4. CNN-LSTM
The integration of CNNs and RNNs has led to substantial research advancements across various fields, with particularly notable success in domains such as speech recognition [
37]. Addressing the traditional CNN model’s dependency on large training datasets, this study introduces LSTM units into the CNN framework. Initially, the data undergoes processing through convolutional and pooling layers, which are responsible for extracting critical features. These extracted features are subsequently passed to the LSTM layers, which further analyze and uncover latent temporal dynamics, thereby enabling the more thorough and nuanced capture of contextual information within sequential data. The architecture of the model is illustrated in
Figure 7.
The CNN-LSTM model adeptly captures the spatial and temporal characteristics inherent in the input data, thereby enabling precise prediction of CO2 emissions.
2.4.5. MSCL-Attention
The MSCL-Attention model seamlessly integrates the multi-scale CNN, LSTM, and multi-head self-attention mechanisms, enabling the efficient extraction and fusion of features from the input data for enhanced predictive performance. The network architecture is depicted in
Figure 8.
In the CNN module, three distinct scales of feature extraction are implemented. The first approach employs direct max pooling to preserve the global features of the data. The second and third approaches utilize convolutional and pooling layers of varying depths to ensure the comprehensive extraction of both global and local features across different levels. These multi-scale feature outputs are concatenated to form a unified representation, which is subsequently refined through a fully connected layer to enhance the feature extraction.
To effectively capture the temporal dependencies inherent in the input sequence, a three-layer stacked LSTM network is employed. The first two LSTM layers retain sequential information, while the final layer produces the definitive temporal feature representation. This feature representation is then integrated with the features extracted from the CNN module, combining spatial and temporal information.
Following this, a multi-head self-attention mechanism is applied to the fused features, enabling the model to capture more complex dependencies. This attention mechanism enhances the model’s capacity to comprehend global information by selectively attending to distinct feature patterns across multiple subspaces, thereby improving its understanding of the input sequence.
The predictive capability of deep learning models is predominantly determined by their hyperparameters and architectural design. These components regulate the model’s complexity, learning efficiency, and capacity for generalization.
Table 3 outlines the hyperparameters employed in the MSCL-Attention model, which play a critical role in optimizing its performance.
4. Discussion
Automobile exhaust emissions represent a significant threat to both air quality and global climate stability. The pollutants present in these emissions not only contribute to the formation of smog and exacerbate air pollution but also precipitate a range of respiratory ailments. Additionally, the considerable quantities of carbon dioxide emitted from the combustion of fossil fuels in vehicles intensify global warming, leading to adverse effects such as climate change, an increase in extreme weather events, and rising sea levels. These phenomena pose profound risks to ecological systems and human societies. Consequently, accurate prediction of emission levels is imperative for devising effective strategies to mitigate the adverse impacts of automobile exhaust.
The MSCL-Attention model introduced in this study combines a multi-scale CNN, LSTM networks, and multi-head self-attention mechanisms to offer a highly accurate and robust solution for carbon dioxide emission forecasting. By leveraging feature extraction and fusion across multiple scales, the model adeptly captures intricate patterns in vehicular emission behavior. The LSTM component addresses both short-term and long-term dependencies within temporal data sequences, while the multi-head self-attention mechanism enhances the predictive accuracy by selectively attending to critical features. This integrative approach not only improves the model precision but also strengthens its capability to handle complex emission prediction tasks.
In discussing the findings of this study, it is essential to compare them with other recent research of a similar nature to provide a comprehensive assessment of the MSCL-Attention model. We compared the predictive performance of the MSCL-Attention model with that of other models by examining the RMSE values for the carbon dioxide emission predictions. The results are presented in
Table 6.
The experimental results demonstrate that the MSCL-Attention model surpasses the traditional LSTM, BiLSTM, and other prevalent machine learning methodologies in the prediction of carbon dioxide emissions, while also exhibiting a reduced parameter count compared to the X-MARL model. The model’s superiority is particularly manifested in the following dimensions:
- (1)
Multi-Scale Feature Extraction: In contrast to single-scale convolutional neural networks, the MSCL-Attention model utilizes multi-scale convolutions to extract features across varying levels, thereby enabling the more comprehensive capture of cross-scale information inherent in vehicular emission behaviors and enhancing the model’s generalization capabilities.
- (2)
Temporal Dependency Management: The LSTM architecture is proficient in modeling both long-term and short-term dependencies within time series data, demonstrating significantly stronger performance relative to traditional machine learning approaches (such as ICSO-SVM and RF-DPSO-BP) in the context of complex time series datasets.
- (3)
Integration of Self-Attention Mechanism: The incorporation of a multi-head self-attention mechanism amplifies the model’s focus on critical information by effectively identifying salient features within the data. This capability allows the model to assign varying weights to different features during prediction, thereby enhancing the predictive accuracy. Such a mechanism is more adept at capturing the dynamic fluctuations in vehicular emissions than conventional time series models (including LSTM and BiLSTM).
5. Conclusions
The prediction of vehicular carbon dioxide emissions not only enables automobile manufacturers to optimize engine design, transmission systems, and fuel efficiency during the vehicle development phase—thereby facilitating the production of more energy-efficient and low-emission automotive products—but also assists nations in fulfilling their emission reduction commitments as stipulated in international climate agreements, thereby fostering global cooperation in combating climate change.
In the realm of automotive design, manufacturers can utilize the comprehensive predictive data provided by the model to refine vehicle specifications, such as modifying engine configurations, fuel systems, and transmission designs, to reduce emissions and enhance fuel efficiency. Moreover, the model supports the development and evaluation of innovative technologies, including advanced combustion techniques and improved energy utilization, enabling manufacturers to introduce high-performance vehicles that comply with environmental standards.
From an environmental policy standpoint, the predictive capabilities of the model can aid policymakers in establishing more stringent emission standards and targets, while concurrently optimizing relevant policies to ensure their efficacy in real-world contexts. Additionally, the model facilitates comprehensive assessments and long-term planning of environmental policies, thereby supporting the monitoring and adjustment of policy measures to advance sustainable development objectives. By contributing to the reduction of greenhouse gas emissions, mitigating climate change, and safeguarding the environment, the model holds substantial significance in achieving carbon reduction goals within the transportation sector.
The MSCL-Attention model introduced in this study presents a novel and efficient approach for the prediction of vehicular carbon dioxide emissions. This model synergistically integrates multi-scale convolutional (MSCL) networks, long short-term memory (LSTM) networks, and self-attention mechanisms, exhibiting superior efficacy in capturing intricate emission characteristics relative to existing models. The experimental findings reveal that the MSCL-Attention model excels in extracting latent spatiotemporal relationships when processing multidimensional data, thereby significantly enhancing the predictive accuracy and underscoring its considerable potential in the realm of automotive emission forecasting.
Nonetheless, the model exhibits certain limitations. First, the training dataset employed in this study originates from publicly accessible sources, with a relatively narrow scope of features and vehicle types, thus failing to capture the full diversity of vehicles across different countries, regions, and models on a global scale. Consequently, the model’s generalization capabilities have not yet been thoroughly validated on broader datasets. Furthermore, the computational efficiency of the model warrants improvement, particularly when applied to large-scale datasets, where the computational overhead may become significant. Future research should prioritize enhancing the model’s computational efficiency while incorporating more diverse and comprehensive datasets to rigorously assess its global applicability. Such advancements would offer more robust support for reducing vehicular emissions, improving air quality, and addressing the challenges of global climate change.