Carbon Emission Forecasting Study Based on Influence Factor Mining and Mini-Batch Stochastic Gradient Optimization

Yang, Wei; Yuan, Qiheng; Wang, Yongli; Zheng, Fei; Shi, Xin; Li, Yi

doi:10.3390/en17010188

Open AccessArticle

Carbon Emission Forecasting Study Based on Influence Factor Mining and Mini-Batch Stochastic Gradient Optimization

by

Wei Yang

¹,

Qiheng Yuan

^1,*,

Yongli Wang

²,

Fei Zheng

³,

Xin Shi

¹ and

Yi Li

²

¹

Big Data Center of State Grid Corporation of China, Beijing 100052, China

²

School of Economics and Management, North China Electric Power University, Beijing 102206, China

³

Beijing China-Power Information Technology Co., Ltd., Beijing 100089, China

^*

Author to whom correspondence should be addressed.

Energies 2024, 17(1), 188; https://doi.org/10.3390/en17010188

Submission received: 18 October 2023 / Revised: 21 December 2023 / Accepted: 26 December 2023 / Published: 29 December 2023

(This article belongs to the Section B3: Carbon Emission and Utilization)

Download

Browse Figures

Versions Notes

Abstract

:

With the increasing prominence of the global carbon emission problem, the accurate prediction of carbon emissions has become an increasingly urgent need. Existing carbon emission prediction methods have the problems of slow calculation speed, inaccurate prediction, and insufficient deep mining of influencing factors when dealing with large-scale data. In this study, a comprehensive carbon emission prediction method is proposed. Firstly, multiple influencing factors including economic factors and demographic factors are considered, and a pathway analysis method is introduced to mine the long-term relationship between these factors and carbon emissions. Then, indirect influence terms are added to the multiple regression equation, and the variable is used to represent the indirect influence relationship. Finally, this study proposes the PCA-PA-MBGD method, which applies the results of principal component analysis to the pathway analysis. By reducing the data dimensions and extracting the main influencing factors, and optimizing the carbon emission prediction model by using a mini-batch stochastic gradient descent algorithm, the results show that this method can process a large amount of data quickly and efficiently, and realize an accurate prediction of carbon emissions. This provides strong support for solving the carbon emission problem and offers new ideas and methods for future related research.

Keywords:

pathway analysis; carbon emission projections; principal component analysis; mini-batch gradient descent; PCA-PA-MBGD methodology

1. Introduction

Carbon emissions are a major issue of global concern, and in order to address the problem of global warming caused by greenhouse gas emissions, China has put forward a clear “dual-carbon” target. Accurately grasping the trend of carbon emissions may provide a rational basis for the country to formulate a reasonable emission reduction plan, which will be vital if the country is to achieve its dual-carbon target. Relying on an integrated energy system and aiming to minimize the total amount of carbon emissions to achieve carbon peaking and carbon neutrality is a major strategic decision made by integrating the international and domestic situations, and it is of great significance to the realization of the “dual carbon” goal.

Carbon emission forecasting has been studied at national and international levels using different methods, and these experts have yielded spectacular results. Luo Bixiong [1] and others used a hybrid measurement model for predicting carbon emissions in the energy sector to optimize the installed power structure and energy consumption structure to achieve the goals of minimizing these structures, making the structures cleaner, and minimizing carbon emissions. Zhang Shiqiang [2] et al. used the path analysis method to show that by optimizing the energy structure and expanding the enterprise scale, inhibiting effects on carbon emissions are achieved. Lv Yan [3] and others dynamically simulated a LSTM Model for carbon emission prediction using scenario analysis to anticipate the future trend of carbon emissions from the construction industry in Xinjiang and to derive the recommendations for emission reductions.

Liu et al. [4] studied the factors affecting carbon emissions from transport based on relevant data from 30 provinces from 2005 to 2019 and aggregated multiple machine learning algorithms into different prediction models. Chen Chuanmin [5] and others constructed a framework for analyzing the carbon emissions of grid enterprises based on the LEAP model in order to quantify the emission reduction contribution rate of each factor during the operation of the grid enterprises; for the correlation of each factor, a two-level scenario analysis model consisting of integrated scenarios and sub-scenarios was designed and constructed by applying the scenario analysis method and the comparative analysis method at the same time. Zhou Cheng [6] et al. used the decomposition–integration strategy, which by decreasing the computation of the original EC prediction problem’s complications and non-linearities can effectively improve the prediction performance. By combining the advantages of trend decomposition, empirical modal decomposition, and wavelet decomposition, a new three-layer differential evolutionary prediction method is proposed. Yang [7] proposed a machine learning-based urban carbon emission prediction method in the context of big data in his article, and determined that the optimal algorithm was Random Forest—which was chosen to predict urban carbon emissions—by comparing the advantages and disadvantages of each algorithm. Wei [8] and others combined the Tapio de-coupling model with the STIRPAT model and estimated its efficacy using the 2000–2020 panel data of Henan Province. The relationship between the economic development of agriculture and animal husbandry and carbon emissions, as well as related factors were examined. Yue [9] and others constructed a GCA-GRNN-DOA carbon emission prediction model in order to find an effective carbon emission prediction method so as to establish corresponding emission reduction measures. Yan [10] et al. improved the results of the forecast in the case of the joint use of multiple algorithmic models and proposed a carbon emission influencing decomposition model based on LMDI for the carbon emissions, and a prediction model of carbon emissions on the basis of the EEMD-BSO-GPR model. Yu [11] et al. constructed a community carbon emission sample database based on the statistical emission factors of the power system of the North China Power Grid. A community carbon emission early-warning system was designed by training an SVR model to predict electricity carbon emissions and optimize the SVR model using GA. Wei [12] et al. used a Tapio decoupling model to explore the connection between carbon emissions and the economy in Henan Province; a STIRPAT extension model and the ridge regression model to find out the factor influencing carbon emission in Henan Province; and they also obtained a carbon emission forecasting model. Wang [13] et al. programmed a Lagrange interpolation algorithm through MATLAB to forecast the carbon emission of the economic growth of Beijing city and provided decision support for the government. An “inverted U” relationship was derived between economic growth and energy consumption and carbon emissions. Chai Tew Ang [14] et al. used an integrated modeling tool consisting of a time series ARIMA model to predict the total CO₂ emissions from 2009 to 2020 in a study of Malaysia’s energy consumption and transportation CO₂ emissions. Ref. [14] et al. used a comprehensive modeling tool consisting of a time series ARIMA model to predict the total carbon dioxide emissions from 2009 to 2020 for energy consumption, and transportation carbon dioxide emissions forecasting in Malaysia. Wang [15] et al. proposed a two-stage prediction method based on Support Vector Regression, Random Forest, Ridge Regression, and Artificial Neural Networks for carbon dioxide forecasting, and compared it with a single prediction method for carbon dioxide emissions forecasting. Peng [16] proposed a model-based method for predicting short-term carbon emissions from green buildings. The IPCC method was used to find the interacting elements of carbon emissions of green buildings, and the interacting elements were classified according to the importance of the interacting elements. The model was then applied to calculate the short-term CO₂ emissions during the construction phase and the whole process of implementing a green building, and the IPAT model was developed to disintegrate the CO₂ emissions as products of dissimilar elements.

As the problem of carbon emission is becoming more and more serious, the demand for the accurate prediction of carbon emission is becoming more and more urgent. Carbon emissions are closely related to the development of society, but the current research generally focuses on several aspects, such as economy and energy, to study the correlation between typical influencing factors and carbon emissions, and there is a lack of research on the mining of carbon emission influencing factors at a deeper level. At the same time, the mining of carbon emission influencing factors needs to deal with large-scale data. This paper applies the path analysis method to analyze the carbon emission influencing factors and derives the influencing relationship between the factors and the corresponding path. The establishment of a carbon emission prediction model in the context of large-scale data, using the parameters of a multiple regression carbon emission prediction model, is achieved by the small-batch gradient descent algorithm. The following are the new innovations in this paper:

(1): Establishing the set of factors affecting carbon emission prediction in the context of large-scale data and exploring the long-term relationship between economic factors, demographic factors, energy structure, and carbon emissions;
(2): In contrast to previous studies that considered only direct impact factors, adding the through-path analysis to present the indirect impact of the influencing factors on carbon emissions, adding the indirect impact term in the multiple regression equation, and using the independent variable to represent the indirect impact relationship;
(3): Proposing the PCA-PA-MBGD methodology, and applying the results of the principal component analysis in the through-path analysis, which improves the timeliness and accuracy of carbon emission calculation in the context of large-scale datasets.

2. Impact Relationship Study Based on Path Analysis Method

2.1. Aggregation of Indicators of Carbon Emission Impact Factors

Factors affecting carbon emissions are diverse and have complex interrelationships. Past studies have used the STIPAT model [17] and LEAP model [18] to decompose the factors affecting carbon emissions. This includes the most significant macro impacts on carbon emissions, such as population size, economic level, and energy intensity, and it covers all aspects from the social development to the carbon emission of energy consumption. However, as the requirements for the correctness of carbon emission projection increase and the excavation of carbon emission driving forces become more in-depth, the decomposition of factors influencing the direction of carbon emissions studied so far is considered less at this level of detail. Therefore, in order to reflect the carbon emission influencing factors more comprehensively, the indicator system affecting carbon emission will be established according to its basic composition.

On the basis of previous research on carbon emission influencing factors, a set of indicators of carbon emission influencing factors considering more dimensions was constructed, as shown in Table 1.

There are serious covariance issues and poor data accessibility in the initial set of influencing factors, so it is not possible to directly analyze the initial set of influencing factors for pass-through analysis. Therefore, a principal component analysis was used to further determine the influencing factors involved in the pass-through analysis via a dimensionality reduction in the influencing factor set data, eliminating the poorly accessible influencing factors, and solving the problem of covariance.

2.1.1. KMO and Bartlett’s Test of Sphericity

The Kaiser–Meyer–Olkin (KMO) test is a common method used to assess the suitability of the principal component analysis model for data. Specifically, the KMO test checks the reasonableness of the data by calculating the proportion of common factor measures for the observed data. If the KMO number is above 0.8, the analysis is suitable; if it is between 0.6 and 0.8, the analysis can be conducted; and if it is below 0.5, the analysis is not recommended.

Bartlett’s test of sphericity is another technique for the judgement of the feasibility of PC. Bartlett’s test of sphericity evaluates the observational data for sphericity (i.e., the absence of correlation between variables) as required for factor analysis based on the matrix of correlation coefficients of the data matrices. If the results of the Bartlett’s test are significant, the data can be considered not to be spherical and therefore suitable for factor analysis. Conversely, factor analysis is not suitable.

The test results of KMO and Bartlett’s test of sphericity are shown in Table 2.

As can be seen from the table above, the KMO value of 0.704 is greater than the criterion of 0.6, which shows that the prerequisites for principal component analysis were met, and therefore the research data can be applied for principal component analysis. At the same time, the data also passed the Bartlett’s test of sphericity (p-value < 0.05), again indicating that the research data selected for this paper are suitable for principal component analysis.

2.1.2. Principal Component Analysis (PCA)

A principal component analysis was performed on the factors concentrated in the carbon emission influencing factors to determine whether there is a strong correlation between the study factors and the principal components, so that the influencing factors can be interpreted in terms of the delineated principal components, and the values of the loading coefficients can be used to analyze the correspondence between each principal component and the factors being analyzed.

Table 3 reflects the information extraction of the principal components for the influencing factors and gives the corresponding relationship between the principal components and the influencing factors. As can be seen from Table 3, all the research items correspond to a common degree value of 0.4 or more, indicating that there is a robust connection between the carbon emission influencing factors and the principal components, and that the principal components can effectively extract the information.

As shown in Table 4, “GDP”, “GDP per capita”, “Urbanization rate”, “Ratio of foreign direct investment to GDP”, “R&D expenditures”, “share of value added of primary sector in GDP”, “share of value added of secondary sector in GDP”, “share of value added of tertiary industry in GDP”, and “value added of industry” are located in the first factor with higher loadings. Therefore, principal component one mainly explains these carbon emission influencing factors, named as economic and industrial structure. “Total Electricity Consumption”, “Total Electricity Generation”, and “Total Thermal Power Generation” are located in the second factor with higher loadings, and principal component two mainly explains these carbon emission influencing factors, which are interpreted as the intensity of electricity consumption. “Raw coal carbon dioxide emissions”, “crude oil carbon dioxide emissions”, and “natural gas carbon dioxide emissions” are located in the third factor with higher loadings, and principal component three mainly explains these carbon emission factors, which are named energy carbon emission intensity. “Urbanization rate” is located in the second factor with higher loadings, and principal component two mainly explains these carbon emission factors, which are interpreted as electricity consumption intensity. “Urbanization rate”, “population size”, and “population per unit area” are located in the fourth factor with high loadings, and principal component four mainly explains these carbon emission factors and explains them as social development. factors. “Coal usage”, “Oil usage”, “Natural gas usage”, and “Energy usage” are located in the fifth factor with high loadings, and principal component five mainly explains these carbon emission influencing factors, which are interpreted as the energy structure.

2.2. Influence Factor Pathway Analysis

As a method of multivariate statistical analysis, pathway analysis can be used to analyze the direct, indirect, and combined contributions of multiple self-variables to the dependent variable. With independent variables

x_{1}

,

x_{2}

,...,

x_{i}

and

y

dependent variables, an elementary model of through-path analysis is shown in Equation (1):

\{\begin{array}{l} p_{1 y} + r_{12} \cdot p_{2 y} + r_{13} \cdot p_{3 y} + \dots + r_{1 k} p_{k y} = A_{1 y} \\ p_{2 y} + r_{21} \cdot p_{1 y} + r_{23} \cdot p_{3 y} + \dots + r_{2 k} p_{k y} = A_{2 y} \\ p_{3 y} + r_{31} \cdot p_{1 y} + r_{32} \cdot p_{2 y} + \dots + r_{3 k} p_{k y} = A_{3 y} \\ \dots \\ r_{k 1} p_{1 y} + r_{k 2} p_{2 y} + r_{k 3} \cdot p_{3 y} + \dots + p_{k y} = A_{k y} \end{array}

(1)

where

p_{i y}

is the direct pathway coefficient, indicating the direct influence of

x_{i}

on the dependent variable

y

;

r_{i j}

p_{j y}

is the indirect pathway coefficient, indicating the indirect influence of

x_{i}

on the dependent variable

y

through

x_{j}

;

r_{i y}

is the simple relevance coefficient between

x_{i}

and

x_{j}

; and

A_{i y}

is the correlation coefficient between

x_{i}

and

y

. In order to be able to express the combined effect of each influencing factor on the principal components of carbon emission formation in a more visual way, the decision coefficient for each variable will be calculated. This coefficient expresses the combined effect of each main component on carbon emissions, which not only clearly explains the direct effect of an independent variable on the dependent variable, but it also covers the indirect effect of this independent variable on the dependent variable through other independent variables, and is calculated as shown in Equation (2):

R_{(i)}^{2} = 2 p_{i y} \cdot r_{i y} - R_{i y}^{2}

(2)

where

R_{(i)}^{2}

is the decision-making coefficient of independent variable

x_{i}

on dependent variable

y

, where if

R_{(i)}^{2}

> 0, it indicates that

x_{i}

enhances the effect of

y

; if

R_{(i)}^{2}

< 0, then

x_{i}

inhibits the effect of

y

. Due to the interrelated nature of the factors affecting carbon emissions, this paper investigates the impact of five types of indicators, namely, economic and industrial structure, electricity consumption intensity, energy carbon emission intensity, social development, and energy structure, on carbon emissions in Tianjin. Each of these factors may indirectly affect carbon emissions through other factors in addition to their direct effects, and we analyze the direct and indirect influences of each indicator on carbon emissions by using Statistical Product and Service Solutions (SPSS IBM SPSS Statistics 26) pass-through analysis to derive the coefficients corresponding to each principal component in the prediction model.

In this paper, carbon emissions were selected as the dependent variable of the through-trail analysis, the secondary indicators in the carbon emission influence factor index system were the independent variables of the through-trail analysis, and the sample period was 1997–2021. SPSS was used to carry out stepwise regression analysis. From the results, it can be clearly recognized mutual influences, interactions, and complex coupling relationships do exist between the influencing factors of carbon emissions; that is to say, the factors with strong correlations between the influencing factors will adopt a way of influencing the independent variables through other variables, and these so-called indirect influences have often been neglected in previous studies, which should be paid attention to. Therefore, this paper adopts the method of using path analysis to conduct an in-depth study of such influences and to explore the interactions between factors.

The direct through-path coefficients and corresponding p-values of the principal components of economic and industrial structure and electric energy consumption are calculated, as shown in Table 5.

Table 5 above shows that by using the direct path effects between variables, the impact values of each factor on the impact of carbon emissions are calculated, and the influencing factors are economic and industrial structure > electricity consumption intensity > energy structure > social development > energy carbon emission intensity.

The coefficient value of the standardized path is 0.941 > 0, and the path is significant at the 0.01 level (z = 45.466, p = 0.000 < 0.01). This shows that the economic and industrial structure has a significant positive impact on carbon emissions. The path of energy carbon intensity on carbon emissions is not significant at the level of 0.05 (p = 0.526 > 0.05), thus indicating that energy carbon intensity does not have a positive impact on carbon emissions.

At the same time, the pathway analysis also considered four groups of indirect pathways, such as the indirect impact of electricity consumption intensity on carbon emissions through an economic and industrial structure. Among them, economic and industrial structure and social development are mutual influencing factors, and their indirect influence on carbon emissions is more obvious. However, energy carbon emission intensity for energy structure influence does not show significance.

It can be seen through the above results that in the current stage of the intensity of carbon emissions in Tianjin City, the biggest influencing factors are still economic and industrial structure. The influences of carbon emission intensity from energy and the energy structure of carbon emissions on the relationship are small due to the characteristics of the industrial structure of the economic development. This is brought about by the impact of carbon emissions with the increase in the relationship, but the energy structure of the intensity of carbon emissions also affects the relationship. However, the energy structure also has some constraints on the intensity of carbon emissions, indicating that the growth of carbon emissions has slowed down with the changes in energy structure and the increased use of clean energy and renewable energy. The pathway model is shown in Figure 1.

The coefficient of determination

R_{(i)}^{2}

= 0.97199; therefore, the residual effect is 1–0.97199, indicating that the explanatory power of the selected influencing factors on China’s carbon emissions is as high as 97.199%, and the pass-through analysis captures the main influencing factors.

3. Multiple Regression Prediction Model Based on Mini-Batch Stochastic Gradient Optimization

3.1. Carbon Emission Prediction Model Optimized by Mini-Batch Stochastic Gradient Descent Algorithm

Gradient descent is a common optimization method used to find the optimal solution. The basic idea of the algorithm is to achieve the function of minimizing losses by iteratively adjusting the parameters.

In traditional algorithms, the entire dataset has to be traversed during each iteration, and the computation is slow when the amount of data is too large. In addition, for the existence of numerous local minima, the use of traditional methods will also fall into local optimality.

Addressing the inefficiency of large-scale matrix operations, this paper intends to propose the use of a mini-batch stochastic gradient descent algorithm (MBGD) to solve the influence factor weight matrix of multiple linear regression equations based on the classical gradient descent method

\hat{θ}

. Figure 2 shows the flowchart of this algorithm.

The following are the steps for solving the mini-batch stochastic gradient descent method.

Step 1: Take the partial derivatives of each

θ_{i}

in Equation (3) separately to obtain the loss equation of the gradient vector, as shown in Equation (3):

\nabla_{θ} M S E (θ) = [\begin{matrix} \frac{\partial}{\partial θ_{0}} M S E (θ) \\ \frac{\partial}{\partial θ_{1}} M S E (θ) \\ M \\ \frac{\partial}{\partial θ_{n}} M S E (θ) \end{matrix}] = \frac{2}{m} X^{T} (X θ - y)

(3)

Step 2: To find the

θ

that minimizes

\nabla_{θ} M S E (θ)

, construct the gradient descent iteration as shown in Equation (4):

θ^{(n + 1)} = θ^{(n)} - η \nabla_{θ} M S E (θ^{(n)})

(4)

where

η

is the rate of learning, determining the step length to each iteration.

Step 3: In accordance with the iterative equation of Equation (4), a mini-batch of example data (

k

sample) are randomly selected for their calculation iteratively, and the gradient and weights are updated once.

Step 4: Step 3 is repeated until all sample data have been trained, and eventually the influence factor weight matrix

\hat{θ}

is calculated.

3.2. Regression Model

The aim of multiple regression analysis study is finding out the quantitative relationship between the depending component

Y

and multiple variables

X = [x_{1}, x_{2}, \dots, x_{n}]

, using the least squares method to find out the linearity between the factors involved.

The multiple linear regression model is calculated as an aggregate of the products of all the impact factors and that of their weights combined with a constancy of deviation, which is modeled as shown in Equation (5):

\hat{y} = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2} + \dots + θ_{n} x_{n}

(5)

where

\hat{y}

is the size of the predicted carbon emissions;

n

is the number of impact factors;

x_{i}

is the value of the

i

impact factor; and

θ_{i}

is the weight given to the

i

impact factor.

In this paper, while considering the direct influence relationship between carbon emissions and the influencing factors, we also consider the indirect influence factors and improve the original multiple regression model. Firstly, for each mediating variable and path, its indirect effect is calculated by adding the product of the independent variables and mediating variables on the path. Then, the results of all the mediating effects are summed up to obtain the overall indirect effect, and the value of the integrated indirect effect should be added to the multiple regression equation as the coefficient of a separate variable. This term is denoted as

x_{j}

. The model of the multiple regression equation considering the indirect effect is obtained as shown in Equation (6):

\hat{y} = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2} + \dots + θ_{n} x_{n} + r p^{2} x_{j}

(6)

Now, it is assumed that the effect of

x_{1}

on

\hat{y}

is realized through the mediating variable

x_{2}

, and the direct effect coefficient

p

and the indirect effect coefficient

r p_{}

of

x_{1}

on

\hat{y}

can be obtained through a copper mirror analysis. To reflect the indirect effect on the multiple regression equation, the coefficient of

x_{1}

is replaced by

p

×

r p_{}

, that is,

r p_{}^{2}

, so that, finally, the multivariate regression model based on the PCA-PA-MBGD method is

\hat{y} = θ_{0} + θ_{1} x_{1} + θ_{2} x_{2} + \dots + θ_{n} x_{n} + r p^{2} x_{j}

(7)

The multiple linear regression model is shown in Equation (8):

\hat{y} = θ^{T} X

(8)

where

θ^{T}

is a vector of weighting factors,

θ^{T} = [θ_{0}, θ_{1}, θ_{2}, \dots, θ_{n}]

;

X

is a matrix of variables, in this case, a vector of influence factors

X = [1, x_{1},

x_{2}, \dots, x_{n}]

.

The closer the size of predicted carbon emissions is to the size of real carbon emissions, the better; i.e., the smaller the mean square error (MSE) of the multiple linear regression training model, the better. Therefore, the MSE function of the linear regression model is derived as shown in Equation (9):

M S E (θ) = \frac{1}{m} {\sum^{}}_{m}^{i = 1} {(θ^{T} X^{(i)} - y^{(i)})}^{2}

(9)

To find the value of

θ

that minimizes the MSE function, it is necessary to find the derivative of the mean square error

M S E (θ)

, so that its derivative is 0, obtaining it as shown in Equation (10):

\nabla_{θ} M S E (θ) = 0

(10)

This leads to the formal equation of the multiple linear regression equation, as shown in Equation (11):

\hat{θ} = {(X^{T} X)}^{- 1} X^{T} y

(11)

By inputting the matrix

X

of the influencing factors of the training set of carbon emission prediction and the training set

y

of carbon emission into the regular equation, the impact factor weight matrix

\hat{θ}

can be derived, and, finally, the carbon emission prediction model is obtained. The regular equation method is unable to be adopted for the projection of carbon emissions on the basis of big-scale data, as the inverse of the equation matrix consumes a huge computational resolution.

3.3. F-Test

The purpose of the F-test in the linearity analysis is to test for a linearity in the relationship between the dependent variable y and the independent variables

x_{1}, x_{2}, \dots, x_{p}

. The test statistic is

F = \frac{SSR / p}{SSE / (n - p - 1)} \overset{H_{0}}{\sim} F (p, n - p - 1)

(12)

SSR = {\sum^{}}_{n}^{i = 1} {({\hat{y}}_{i} - \bar{y})}^{2}

(13)

SSE = {\sum^{}}_{n}^{i = 1} {(y_{i} - {\hat{y}}_{i})}^{2}

(14)

where SSR is the sum of squared regressions; SSE is the sum of squared errors;

{\hat{y}}_{i}

is the

i

estimate;

\bar{y}

is the mean of observations;

p

is the number of independent variables; and

n

is the number of observations.

For a certain significant level, a threshold value for the refusal domain can be obtained by checking the table according to the first degree of freedom

p

and the second degree of freedom

(n - p - 1)

F_{α} (p, n - p - 1)

. Then, comparing it with the

F

test value obtained from the calculation, if

F ⩽ F_{α} (p, n - p - 1)

, it is considered that there is no significant linearity with

y

and

X

, and if

F > F_{α} (p, n - p - 1)

,

y

can be

x_{1}, x_{2},

\dots, x_{p}

fitted with a linear fit.

4. Example Analysis

4.1. Calculation Background

This experiment is based on the data of annual carbon emissions and their influencing factors in Tianjin from 1997 to 2021. (China Emission Accounts and Datasets (CEADs), TIAN JIN STATISTICAL YEARBOOK (1997–2021)). Firstly, the carbon emission prediction data and influence factor data are subjected to data preprocessing, and then abnormal and incomplete data are eliminated to finally construct the influence factor dataset. Then, we analyze the pass-through of the carbon emission influencing factors to obtain the direct and indirect influence relationship between the influencing factors and carbon emission. The multiple regression equation model considering the indirect effect is established, and the coefficients are calculated using the mini-batch gradient descent algorithm, and finally the carbon emission in Tianjin is predicted under the consideration of multiple influencing factors. The flow of the PCR-PA-MBGD method proposed in the paper is shown in Figure 3.

4.2. Indicators for Evaluating the Results of Carbon Emission Projections

The evaluation indexes used in this paper are relative error, average absolute percent error, and root mean square error.

(1) Relative error (relative error, RE) is the ratio of the absolute error to the actual value, and the formula is shown in Equation (15):

RE = \frac{y (i) - \hat{y} (i)}{y (i)}

(15)

(2) The mean absolute percentage error (MAPE) can precisely indicate the size of the errors in prediction, and the calculation formula is shown in Equation (16):

MAPE = \frac{1}{n} {\sum^{}}_{n}^{i = 1} |\frac{y (i) - \hat{y} (i)}{y (i)}| \times 100 %

(16)

(3) The root mean square error (RMSE) is the difference between the prediction data and the observation data, which is calculated as shown in Equation (17):

R M S E = \sqrt{\frac{1}{n} {\sum^{}}_{n}^{i = 1} | y (i) - \hat{y} (i) |^{2}}

(17)

4.3. Comparison of Carbon Emission Projection Results

The screened indicator set was recoded as

x_{1}, x_{2},

\dots, x_{p}

. The variables of economic and industrial structure and electricity consumption intensity are significantly uncorrelated at a 95% confidence interval (F = 13,476 > F0.05 = 5.050). This suggests that the variance model has a significant linear relationship with economic and industrial structure variables.

The first 20 years of the dataset (1997–2016) were used as a training set for the prediction data, and the last 5 years (2017–2021) were used as a test set and for carbon emission prediction. The prediction effect of the prediction model is better. The maximum relevant error of this carbon-emission predictive model is 4.51%, the average absolute percentage error is 1.7372%, the root-mean-square error is 6,290,082.0674 tons, and the error-rate curve of the model prediction results before and after the optimization of the MBGD algorithm is shown in Figure 4.

As can be seen in Figure 4, the MAPE of this method is reduced by 1.219% and the RMSE is reduced by 113,597.08 tons compared to the multiple regression equation without the PCA-PA-MBGD model.

In order to further test the validity of the PCA-PA-MBGD model, several comparative models are introduced to compare the prediction method of this paper with the Long Short-Term Memory Model (LSTM). LSTM is able to efficiently capture long-range dependencies in sequential data and therefore shows superior performance in time series prediction tasks. The Autoregressive Integrated Moving Average Model–Support Vector Regression combined model (ARIMA–SVR) is a combined forecasting model that combines both ARIMA and SVR methods. This combined model aims to improve forecasting accuracy while taking into account both linear and non-linear characteristics of time series data. The Grey Model (GM) is a prediction method based on grey system theory, mainly used to deal with systems with uncertainty and incomplete information. The specific absolute percent errors are shown in Figure 5. And MAPE, RMSE, and computational speed are used to visually evaluate the performance of various models, as shown in Table 6.

From Figure 5, it can be seen that the PCA-PA-MBGD model proposed in this paper has the highest prediction accuracy under the premise of a large-scale dataset compared with other classical carbon emission prediction models.

As can be seen from Table 6, compared with other models, the PCA-PA-MBGD model has the highest computational accuracy (MAPE) and the fastest computational speed, but it also has some defects, and the stability of the prediction results (RMSE) of PCA-PA-MBGD is slightly lower than that of the ARIMA-SVR model.

5. Conclusions

Aiming at the problems of slow calculation speed, inaccurate prediction, and insufficient mining depth of influencing factors for carbon emission prediction in the context of big data, the article first considers a number of influencing factors, such as economic factors, demographic factors, and energy structure, when building the carbon emission prediction model. By mining the long-term relationship between the set of influencing factors and carbon emissions, the formation mechanism of carbon emissions can be more comprehensively understood, providing ideas for providing more accurate prediction results afterwards. Secondly, compared with previous studies that only considered the impacts of direct influencing factors on carbon emissions, this study introduces the method of pathway analysis. By analyzing the indirect influence paths of the influencing factors on carbon emissions, we are able to assess the mechanism of each factor more comprehensively and reveal the hidden influence relationships. Finally, the PCA-PA-MBGD methodology is proposed, and the results of the principal component analysis are applied to the pathway analysis. This method improves the timeliness and correctness of carbon emission calculations in the case of large-scale datasets. By reducing the data dimensions and extracting the main influencing factors, the through-path analysis can be performed more efficiently and can provide more reliable parameters for the establishment of carbon emission prediction models.

The carbon emission prediction model built according to the proposed small-batch stochastic gradient descent algorithm for carbon emission prediction has an MAPE of 3.132% and RMSE of 629,082.0674. Relative to the multiple regression model before optimization with MBGD, this method reduces the MAPE by 1.219% and the RMSE by 113,597.08. Meanwhile, comparing the prediction results of the PCA-PA-MBGD model proposed in this paper with those of other carbon emission prediction models that have been investigated, it is found that in the context of further large-scale data, the PCA-PA-MBGD model’s computational speed and MAPE are better than other models. This means that the method can handle a large amount of data with speed and high efficiency and can construct an effective prediction model for carbon emissions.

Meanwhile, based on the results of carbon emission influencing factors, mining can provide some suggestions for policy making and future research on carbon emission forecasting:

(a): Compared with the results of existing research, the economic and industrial structure is one of the influencing factors that is more closely related to carbon emissions. Since the secondary industry has high energy consumption and high emissions in terms of carbon emissions, it is crucial to optimize the industrial structure without affecting the economic growth rate, and certain policies need to be introduced to weaken the link between resource consumption and carbon emissions. Support for energy management and technological innovation should be strengthened so as to promote the transition of enterprises from traditional to clean energy.
(b): The intensity of electricity consumption is second only to economic and industrial structure in its influence on carbon emission forecasts. Since electricity is one of the indispensable energy sources to support various industries, and there is a large amount of carbon release in the power generation process itself, we can focus on exploring the relationship between electricity and carbon emissions in the subsequent research in order to accurately predict and measure carbon emissions.

Although this paper has made some progress in the study of carbon emission prediction, there is some room for improvement in the model, and the stability (RMSE) of the model needs to be improved. Since the small batch descent algorithm is more suitable for a large-scale data background, more influencing factors should be explored to calculate the carbon emissions, which can highlight the faster accuracy of the PCA-PA-MBGD model compared with other traditional models. In the future, there is a need to continue to improve and refine the existing methodology to increase the stability of the model’s predicted data. If feasible, it is still necessary to continue to look for ways to mine the factors affecting carbon emissions in the context of large-scale data, so as to explain the correlation between the factors and carbon emissions.

Author Contributions

Conceptualization, W.Y.; Methodology, Q.Y.; Formal analysis, F.Z.; Writing—review & editing, Y.L.; Visualization, Y.W.; Supervision, X.S. All authors have read and agreed to the published version of the manuscript.

Funding

Big Data Center of State Grid Corporation of China Science and Technology Project Grant (contract No. SGSJ0000NYJS2310039).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to [The data are not publicly available due to privacy orethical restrictions].

Conflicts of Interest

Author Wei Yang, Qiheng Yuan and Xin Shi were employed by the company Big Data Center Of State Grid Corporation Of China. Author Fei Zheng was employed by the company Beijing China-Power Information Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhang, B.; Li, Y.; Zhang, L. Optimization of Carbon System Structure and Carbon Emission Forecasting Methodology in Energy Sector with Example Validation. J. Coal Sci. Eng. (China) 2023, 48, 2657–2667. [Google Scholar]
Zhang, S.; Li, Y.; Zhang, Y.; Ren, Y.; Jiang, P. Pathway Analysis of Carbon Emission Drivers in Shandong Coal Industry. Chin. Coal 2017, 43, 16–23+40. [Google Scholar]
Lv, Y.; Song, H.; Nan, X. Forecasting the peak carbon emissions of Xinjiang’s construction industry based on the scenario analysis method. Mod. Electron. Tech. 2023, 46, 121–127. [Google Scholar]
Liu, H.; Hu, D. Construction and Analysis of Transportation Carbon Emission Prediction Model Based on Machine Learning. Environ. Sci. 2023, 1–17. [Google Scholar]
Chen, C.; He, Y.; Cai, X. Forecasting carbon emission scenarios and analyzing emission reduction potential of power grid enterprises based on LEAP model. J. North China Electr. Power Univ. 2023, 1–8. [Google Scholar]
Zhou, C.; Chen, X. Forecasting China’s energy consumption and carbon emission based on multiple decomposition strategy. Energy Strategy Rev. 2023, 49, 101160. [Google Scholar] [CrossRef]
Yang, Y. Method for predicting urban carbon emissions under the background of big data. Comput. Informatiz. Mech. Syst. 2023, 6. [Google Scholar]
Wei, Z.; Wei, K.; Liu, J.; Zhou, Y. The relationship between agricultural and animal husbandry economic development and carbon emissions in Henan Province, the analysis of factors affecting carbon emissions, and carbon emissions prediction. Mar. Pollut. Bull. 2023, 193, 115134. [Google Scholar] [CrossRef] [PubMed]
Yue, H.; Bu, L. Prediction of CO₂ emissions in China by generalized regression neural network optimized with fruit fly optimization algorithm. Environ. Sci. Pollut. Res. 2023, 30, 1–17. [Google Scholar] [CrossRef] [PubMed]
Yan, Z.; Li, Y.; Luo, H.; Zhang, S.; Zhu, D.L. Decomposition of Carbon Emission Influencing Factors and Peak Prediction in Ningxia Region. J. Phys. Conf. Ser. 2023, 2488, 012006. [Google Scholar] [CrossRef]
Yu, H.; Yang, Y.; Li, B.; Liu, B.; Guo, Y.; Wang, Y.; Guo, Z.; Meng, R. Research on the community electric carbon emission prediction considering the dynamic emission coefficient of power system. Sci. Rep. 2023, 13, 5568. [Google Scholar] [CrossRef]
Wei, Z.; Wei, K.; Liu, J. Decoupling relationship between carbon emissions and economic development and prediction of carbon emissions in Henan Province: Based on Tapio method and STIRPAT model. Environ. Sci. Pollut. Res. 2023, 30, 52679–52691. [Google Scholar] [CrossRef]
Wang, Y.M.; Wang, Y.Y.; Shen, L.X. Application of Lagrange Interpolation Algorithm in Beijing Carbon Emissions Prediction. Adv. Mater. Res. 2014, 1010, 1844–1849. [Google Scholar] [CrossRef]
Ang, T.C.; Morad, N.; Ismail, T.M. Projection of Carbon Dioxide Emissions by Energy Consumption and Transportation in Malaysia: A Time Series Approach. J. Energy Technol. Policy 2013, 3, 61–75. [Google Scholar]
Wang, C.; Li, M.; Yan, J. Forecasting carbon dioxide emissions: Application of a novel two-stage procedure based on machine learning models. J. Water Clim. Chang. 2023, 14, 477–493. [Google Scholar] [CrossRef]
Fang, P. Short-term carbon emission prediction method of green building based on IPAT model. Int. J. Glob. Energy Issues 2023, 45, 1–13. [Google Scholar]
Yang, W. Effects of population growth and urbanization on CO₂ emissions. China’s Popul. Resour. Environ. 2012, 22, 284–288. [Google Scholar]
Hu, H.; Wang, Q.; Zhu, L.; Zhang, Y. Analysis of building carbon emission prediction based on LEAP model and LMDI decomposition. J. Beijing Univ. Civ. Archit. 2023, 39, 80–87. [Google Scholar]

Figure 1. Model diagram of the results of the pathway analysis.

Figure 2. Flow chart of mini-batch stochastic gradient descent.

Figure 3. Flowchart of PCR-PA-MBGD approach.

Figure 4. RE value of prediction results before and after MBGD optimization.

Figure 5. Comparison of the absolute percent error curves of the predicted and true values of each model.

Table 1. Segmentation of the impact of carbon emissions.

Factor	Serial Number
GDP	X1
GDP per capita	X2
Urbanization rate	X3
Foreign direct investment as a percentage of GDP	X4
R&D expenditure	X5
Size of population	X6
Population per unit area	X7
Coal consumption	X8
Oil consumption	X9
Natural gas consumption	X10
Energy consumption	X11
Value added by industry	X12
Total natural thermal power generation	X13
Primary sector as a share of GDP	X14
Secondary sector as a share of GDP	X15
Tertiary sector as a share of GDP	X16
Total amount of electricity generated	X17
Total electricity consumption	X18
Carbon emissions from coal energy	X19
Carbon emissions from oil energy	X20
Natural gas energy carbon emissions	X21

Table 2. KMO and Bartlett’s test.

KMO	0.704
Bartlett Sphericity Inspection	ACS (math.)	908.913
	df	136
	p-value	0.000

Table 3. Load factor table.

Name	LOAD Factor					Commonality (Common Factor Variance)
Name	Principal Component1	Principal Component2	Principal Component3	Principal Component4	Principal Component5	Commonality (Common Factor Variance)
GDP (million CNY)	0.982	0.143	−0.041	0.074	0.014	0.992
GDP per capita	−0.911	0.300	−0.187	0.190	−0.004	0.991
Urbanization rate	0.994	−0.054	0.082	−0.006	−0.029	0.998
Ratio of foreign direct investment to GDP (USD million)	0.911	−0.229	−0.051	0.291	−0.072	0.976
Expenditure on R&D	0.898	0.210	0.025	0.117	0.185	0.900
Population size (10,000 people)	0.988	0.031	0.040	−0.089	0.037	0.988
Population per unit area (10,000 people/km²)	0.664	0.502	0.059	−0.521	0.017	0.969
Coal consumption	0.820	−0.492	−0.030	0.001	0.203	0.957
Natural gas consumption	0.976	0.173	−0.052	0.040	0.044	0.988
Energy consumption	0.976	−0.169	0.110	−0.058	0.009	0.997
Value added of primary industry as % of GDP	−0.934	0.259	−0.150	0.163	−0.004	0.989
Value added of secondary industry as % of GDP	−0.361	−0.903	0.060	−0.149	0.127	0.987
Value added of tertiary industry in GDP	0.638	0.746	−0.006	0.084	−0.115	0.985
Value added of industry (billion CNY)	0.990	−0.005	−0.030	0.103	0.007	0.991
Total electricity consumption	0.995	−0.070	0.023	0.016	−0.003	0.996
Total electricity generation	0.975	−0.101	0.126	0.027	−0.123	0.993
Total thermal power generation	0.972	−0.097	0.116	0.061	−0.122	0.986
Raw coal CO₂ emissions	0.874	−0.423	0.076	0.128	−0.166	0.993
Oil consumption	−0.591	−0.089	0.739	−0.115	−0.212	0.961
Natural gas CO₂ emissions	0.886	0.406	−0.040	−0.095	0.092	0.969
Crude oil CO₂ emissions	−0.288	0.326	0.822	0.233	0.235	0.974

Table 4. Component matrix after rotation a.

	Ingredient
	1	2	3	4	5
GDP	0.804
GDP per capita	0.947
Urbanization rate				0.917
Ratio of foreign direct investment to GDP	0.912
Expenditure on R&D	0.713
Population size (10,000 people)				0.860
Population per unit area				0.670
Coal Consumption					0.897
Oil Consumption					0.740
Natural gas consumption					0.780
Energy Consumption					0.946
Value added of primary industry as a percentage of GDP	0.945
Value added of secondary industry as % of GDP	0.992
Value added of tertiary industry as a share of GDP	0.948
Value added of industry (billion CNY)	0.876
Total electricity consumption		0.911
Total electricity generation		0.938
Total thermal power generation		0.934
Carbon dioxide emissions from crude coal			0.976
Carbon dioxide emissions from crude oil			0.955
Natural gas CO₂ emissions			0.693

Table 5. Summary grid of model regression coefficients.

X	→	Y	Unstandardized Path Coefficients	SE	z (CR-Value)	p	Standardized Path Factor
Economic and Industrial Structure	→	Carbon Emissions	9.425	0.207	45.466	0.000	0.941
Electricity Intensity	→	Carbon Emissions	3.527	0.501	7.044	0.000	0.152
Energy Carbon Intensity	→	Carbon Emissions	0.234	0.706	0.332	0.526	−0.014
Social Development	→	Carbon Emissions	3.585	1.070	3.352	0.001	0.072
Energy Structure	→	Carbon Emissions	−0.982	1.548	−0.634	0.000	−0.144
Intensity of Electricity Consumption	→	Economic and Industrial Structure	−0.000	1.127	−0.000	0.007	0.012
Social Development	→	Economic and Industrial Structure	9.748	1.907	5.112	0.000	1.877
Economic and Industrial Structure	→	Social Development	−0.366	0.206	−1.779	0.0075	−1.889
Energy Carbon Emission Intensity	→	Energy Structure	0.000	0.099	0.000	1.000	0.000

→ represents the conduction relationship.

Table 6. Comparison of the performance of the prediction models.

Model	MAPE	RMSE	Calculation Time (s)
GM	3.4478%	718,367.49	17.7535875
LSTM	2.6519%	87,139.28	18.805651
ARIMA-SVR	2.4545%	238,746.13	17.1570467
PCA-PA-MBGD	1.7372%	629,082.0674	14.996245

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, W.; Yuan, Q.; Wang, Y.; Zheng, F.; Shi, X.; Li, Y. Carbon Emission Forecasting Study Based on Influence Factor Mining and Mini-Batch Stochastic Gradient Optimization. Energies 2024, 17, 188. https://doi.org/10.3390/en17010188

AMA Style

Yang W, Yuan Q, Wang Y, Zheng F, Shi X, Li Y. Carbon Emission Forecasting Study Based on Influence Factor Mining and Mini-Batch Stochastic Gradient Optimization. Energies. 2024; 17(1):188. https://doi.org/10.3390/en17010188

Chicago/Turabian Style

Yang, Wei, Qiheng Yuan, Yongli Wang, Fei Zheng, Xin Shi, and Yi Li. 2024. "Carbon Emission Forecasting Study Based on Influence Factor Mining and Mini-Batch Stochastic Gradient Optimization" Energies 17, no. 1: 188. https://doi.org/10.3390/en17010188

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Carbon Emission Forecasting Study Based on Influence Factor Mining and Mini-Batch Stochastic Gradient Optimization

Abstract

1. Introduction

2. Impact Relationship Study Based on Path Analysis Method

2.1. Aggregation of Indicators of Carbon Emission Impact Factors

2.1.1. KMO and Bartlett’s Test of Sphericity

2.1.2. Principal Component Analysis (PCA)

2.2. Influence Factor Pathway Analysis

3. Multiple Regression Prediction Model Based on Mini-Batch Stochastic Gradient Optimization

3.1. Carbon Emission Prediction Model Optimized by Mini-Batch Stochastic Gradient Descent Algorithm

3.2. Regression Model

3.3. F-Test

4. Example Analysis

4.1. Calculation Background

4.2. Indicators for Evaluating the Results of Carbon Emission Projections

4.3. Comparison of Carbon Emission Projection Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI