Predicting Energy-Based CO2 Emissions in the United States Using Machine Learning: A Path Toward Mitigating Climate Change

Tian, Longfei; Zhang, Zhen; He, Zhiru; Yuan, Chen; Xie, Yinghui; Zhang, Kun; Jing, Ran

doi:10.3390/su17072843

Open AccessArticle

Predicting Energy-Based CO₂ Emissions in the United States Using Machine Learning: A Path Toward Mitigating Climate Change

by

Longfei Tian

^1,†,

Zhen Zhang

^1,†,

Zhiru He

²,

Chen Yuan

³

,

Yinghui Xie

¹,

Kun Zhang

⁴ and

Ran Jing

^3,*

¹

College of Natural Resources and Environment, Joint Institute for Environmental Research & Education, South China Agricultural University, Guangzhou 510642, China

²

School of Engineering and Applied Science, The George Washington University, Washington, DC 20052, USA

³

Department of Civil and Environmental Engineering, University of Maryland, College Park, MD 20742, USA

⁴

Institute of Agricultural Resources and Environment, Guangdong Academy of Agricultural Sciences, Guangzhou 510640, China

^*

Author to whom correspondence should be addressed.

^†

These author contributed equally to this study.

Sustainability 2025, 17(7), 2843; https://doi.org/10.3390/su17072843

Submission received: 26 January 2025 / Revised: 26 February 2025 / Accepted: 11 March 2025 / Published: 23 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Climate change is one of the most pressing global challenges that could potentially threaten ecosystems, human populations, and weather patterns over time. Impacts including rising sea levels and soil salinization are caused by climate change, primarily driven by human activities such as fossil fuel combustion for energy production. The resulting greenhouse gas (GHG) emissions, particularly carbon dioxide (CO₂) emissions, amplify the greenhouse effect and accelerate global warming, underscoring the urgent need for effective mitigation strategies. This study investigates the performance and outcomes of various machine learning regression models for predicting CO₂ emissions. A comprehensive overview of performance metrics, including R², mean absolute error, mean squared error, and root-mean-squared error, and cross-validation scores for decision tree, random forest, multiple linear regression, k-nearest neighbors, gradient boosting, and support vector regression models was conducted. The biggest source of CO₂ emissions was coal (46.11%), followed by natural gas (25.49%) and electricity (26.70%). Random forest and gradient boosting both performed well, but multiple linear regression had the highest prediction accuracy among machine learning models (R² = 0.98 training, 0.99 testing). Support vector regression (SVR) and k-nearest neighbors (KNN) demonstrated lower accuracies, whereas decision tree displayed overfitting. The decision tree, random forest, multiple linear regression, and gradient boosting models were found to be extremely sensitive to coal, natural gas, and petroleum (transportation sector) based on sensitivity analysis. Random forest and gradient boosting demonstrated the most sensitivity to coal usage, whereas KNN and SVR maintained excellent R² scores (0.94–0.98) but were less susceptible to changes in the variables. This analysis provides insights into the agreement and discrepancies between predicted and actual CO₂ emissions, highlighting the models’ effectiveness and potential limitations.

Keywords:

climate change; carbon dioxide emissions; life cycle assessment; decision tree; random forest; multiple linear regression; K-nearest neighbors; gradient boosting; support vector regression

1. Introduction

Climate change is a pressing issue with far-reaching consequences that potentially threaten billions of lives [1,2]. The consequences of climate change include rising sea levels, flooding, salinization, loss of soil biodiversity, landslides, desertification, and erosion acceleration [3,4]. These changes have profound implications for ecosystems, weather patterns, and human populations across the globe. Scientific evidence supports the assertion that the primary driver of climate change is human activity, especially in the domain of energy production and consumption [5]. One of the major concerns is the impact of human energy use and consumption, which significantly contributes to the emission of greenhouse gases [6]. The combustion of fossil fuels, such as coal, oil, and natural gas, releases substantial amounts of CO₂ into the atmosphere, amplifying the greenhouse effect and intensifying global warming [7]. Mitigating the effects of global warming has become an urgent priority in safeguarding the future of our planet. As nations strive to address the challenges posed by global warming, it becomes imperative to accurately quantify and predict greenhouse gas (GHG) emissions, which play a significant role in exacerbating this issue [8].

In the United States, energy consumption represents a critical aspect of human activity, and it is strongly associated with CO₂ emissions [9]. It should be noted that the total primary energy consumption in the U.S. witnessed a significant increase from 97 quadrillion British thermal units (BTU) in 2021 to 100 quadrillion BTU in 2022 [10]. This escalation in energy consumption further emphasizes the necessity for accurate methodologies to quantify and predict CO₂ emissions to better understand the environmental impact of human energy consumption [11]. The development of a reliable methodology for the calculation and forecasting of CO₂ emissions is of paramount importance. Precise calculations lead to essential data for policymakers, scientists, and researchers to make informed decisions and craft effective strategies for mitigating global warming [12]. Additionally, forecasting future CO₂ emissions allows for the assessment of potential scenarios and the formulation of targeted measures to minimize the adverse effects of climate change.

To date, different machine learning methods have been successfully applied to estimate the carbon footprint and conduct other environmental impact studies [13,14]. The application of machine learning in these fields has received a lot of attention, including investigating the environmental impact characterization factors, estimating CO₂ emissions, and managing energy sustainability. Kumari and Singh [15] highlighted India’s high CO₂ emission levels and their potential detrimental effects on the environment. Their study explored six models, including statistical, machine learning, and deep learning-based models, based on time-series data from 1980 to 2019. The performance analysis concludes that the long short-term memory (LSTM) model outperforms others and is recommended as the most suitable for predicting CO₂ emissions. Ahmed, Shuai, and Ahmed [16] investigated the increasing greenhouse gas emissions driven by energy consumption, particularly in China, India, the USA, and Russia. The study forecasted greenhouse gas emissions from 2019 to 2023, employing advanced machine learning algorithms. The LSTM model demonstrated promising accuracy in predicting CO₂, methane, and nitrous oxide (N₂O) emissions for these countries. Yamaka, Phadkantha, and Rakpho [17] proposed using three machine learning models (LASSO regression, Ridge regression, and Elastic net regression) to understand the economic and energy impacts on climate change through greenhouse gas emissions in China and the USA. These models can effectively address the limitations of the ordinary least squares (OLS) model, facilitating a deeper exploration of the economic and energy influences on greenhouse gas emissions. Mardani, Liao, Nilashi, Alrasheedi, and Cavallaro [18] presented a methodology for predicting CO₂ emissions, with a specific focus on energy consumption and economic growth as pivotal factors. The study employed clustering and machine learning techniques in conjunction with dimensionality reduction for precise predictions. Through the application of the self-organizing map clustering algorithm, data were organized into clusters, and prediction models were constructed using the adaptive neuro-fuzzy inference system and artificial neural networks for each cluster. This approach facilitated accurate predictions for the Group of 20 (G20) countries, taking into account economic growth and energy consumption.

The integration of machine learning methods with LCA enables a more comprehensive assessment of environmental impacts [19]. While LCA provides a structured framework for evaluating the entire life cycle of a product or process, machine learning techniques can enhance this analysis by incorporating complex relationships and patterns that may be difficult to capture using traditional LCA methods alone. In addition, many environmental systems are complex and characterized by uncertainty [20]. Machine learning techniques, with their ability to handle large datasets and capture intricate relationships, can enhance LCA by addressing this complexity and uncertainty [21]. Integrating machine learning can provide more nuanced insights into variables that influence environmental impacts. Notably, existing studies on CO₂ emissions prediction using machine learning techniques often lack a comprehensive comparison with CO₂ emissions calculated through life cycle assessment (LCA) methods [22]. While numerous studies focus on machine learning-based predictions, few have explored the alignment of these predictions with LCA-calculated CO₂ emissions. This research gap highlights the need for further investigation to assess the accuracy and reliability of machine learning models in estimating CO₂ emissions when compared to established LCA methodologies [23,24]. Therefore, this study aims to (i) calculate CO₂ emissions generated from 50 years of energy consumption in the USA using the LCA approach; (ii) examine six machine learning models for CO₂ emission forecasting and evaluate their performance using three metrics: mean absolute error (MAE), mean squared error (MSE), and root-mean-squared error (RMSE); and (iii) conduct a comparative analysis of predicted CO₂ emissions from different machine learning models and LCA methodologies.

2. Methodology

2.1. Data Collection

Energy consumption and CO₂ data were collected from the U.S. Energy Information Administration (EIA). These datasets encompass historical U.S. energy statistics, including total energy production from petroleum, natural gas, coal, electricity, and carbon dioxide emissions. The data span from 1973 to 2022 and are presented on a monthly basis. In this study, 600 data points for each energy were extracted and used for machine learning regression and the CO₂ evaluation based on LCA calculation. In addition, 600 CO₂ data points were used for machine learning regression. Table 1 shows a descriptive analysis of the energy and CO₂ emission dataset, which indicates the values of the mean, median, standard deviation, standard error, minimum, and maximum. Upon examining these values, a clear trend of energy consumption and CO₂ emissions can be observed. In this study, 600 data points per category is a relatively moderate sample size, which could result in overfitting. In this case, the model might not generalize well to data that have not been observed. In addition, a limited dataset could make it more difficult for the model to identify seasonal fluctuations and long-term patterns in CO₂ emissions and energy use.

2.2. Principal Component Analysis

Principal component analysis (PCA) was conducted in Python version 3.8.12 in JupyterLab using the scikit-learn package, version 0.24.2 [25]. The energy consumption dataset was loaded and centralized using the Pandas library (version 1.1.3) and was further analyzed by the PCA algorithm from the scikit-learn library. The explained variance ratio (ratio of the eigenvalue of eigenvectors to total eigenvalues) attribute of the PCA object was calculated to evaluate the variance explained by each principal component. This provided the explained variance for each principal component. Additionally, the cumulative variance was obtained by computing the cumulative sum of the explained variance ratios using the NumPy library (version 1.21.2). To visualize the variance explained by different numbers of principal components, an elbow plot was generated. The plot was created using a bar chart, where the x-axis represented the number of principal components, and the y-axis represented the explained variance. Additionally, a line plot was overlaid on the bar chart to display the cumulative explained variance. Furthermore, the variance covered by the first and second principal components was computed. This was achieved by extracting the explained variance ratios for the respective components from the PCA object.

To perform the PCA, a PCA object was initialized with two components. First, the PCA model was fitted to the input data to capture the underlying covariance structure. Subsequently, the data were transformed into a reduced-dimensional space to enable more efficient analysis. The fit_transform function from the scikit-learn library was used by combining the fitting and transformation steps into a single operation. By utilizing this method, the PCA model can determine the principal components by analyzing the covariance structure of the input data. These principal components were then employed to project the original data onto a new space, known as the principal component space. After that, a scatter plot was created to visualize the projected data in the two-dimensional principal component space. The x-axis represented the first principal component, and the y-axis represented the second principal component. Finally, each data point was colored based on the corresponding target variable, creating a visual representation of the relationship between the principal components and the CO₂ emissions.

2.3. Methods for Calculating CO₂ Emissions from Life Cycle Assessments (LCAs)

The monthly energy consumption data from 1973 to 2022 and the emission factors for LCA calculations are presented in Table 1 and Table 2, respectively. The emission factors for coal and natural gas were collected from the U.S. Environmental Protection Agency (EPA) GHG Emission Factors Hub [26]. The emission factor for electricity in this study was obtained from the U.S. EIA. In 2021, U.S. utility-scale power plants generated 4.11 trillion kWh of electricity, resulting in 1.65 billion metric tons (1.82 billion short tons) of CO₂ emissions. Hence, the CO₂ emission factor for electricity was determined to be 0.855 pounds/kWh (0.388 kg CO₂/kWh) [27]. The emission factors for total petroleum consumption across various sectors (residential, commercial, industrial, transportation, and electric power) were computed based on a series of energy sources, such as motor gasoline, distillate fuel, aviation gasoline, jet fuel, butane, ethane, propane, pentanes, and other oils. This calculation also considered the distribution of petroleum product consumption by sources and sectors [28]. Figure S1 in Supplemental Data S2 shows the percentage share of petroleum product consumption (motor gasoline, distillate fuel, hydrocarbon gas liquids, aviation gasoline, jet fuel) and their usage distribution across different sectors [27]. The emission factors were calculated as the weighted average of the percentage contribution of each energy source and their consumption data. The emission factor for hydrocarbon gas liquids varies based on the source and specific mix, and the percentage of individual components is also dependent on the local atmospheric pressure. Hydrocarbon gas liquids usually contain four components. In this study, the percentage of each individual component is assumed to be 25%. The emission factor for hydrocarbon gas liquids was determined by multiplying the emission factors of its four major components (butane, ethane, propane, and pentanes) by 25%. The CO₂ emissions from the LCA of each energy consumption from different sectors in the USA were calculated by multiplying the material consumption data with the corresponding emission factors (Equation (1)). The total CO₂ emissions (G_{total, CO₂}) were the sum of the CO₂ emissions from each energy consumption (G_i, _CO₂) from different energy sectors (Equation (2)).

G_{i, CO₂} = E_i × EF_{i, CO₂},

(1)

G_{total, {CO}_{2}} = \sum_{i = 1}^{8} (G_{i, {CO}_{2}}),

(2)

where i (i = 1, 2, 3, 4, 5, 6, 7, 8) denotes the energy types for the different sectors; E_i denotes the energy consumption data for the different sectors which are summarized in Table 1; and EF_{i, CO₂} are the CO₂ emission factors (EFs) for the different energy sources i, which are summarized in Table 2.

2.4. Machine Learning Models

In this study, six machine learning models, decision tree, random forest, multiple linear regression, k-nearest neighbors, gradient boosting, and support vector regression, were used to predict the CO₂ emissions generated in the USA. Decision trees are easy to comprehend but prone to overfitting, whereas multiple linear regression is straightforward and interpretable. Random forest and gradient boosting can handle nonlinearities, which can effectively handle feature interactions. Last but not least, k-nearest neighbors and support vector machine can handle both linear and nonlinear relationships, allowing a comprehensive comparison of model performance. To improve the generalizability of the training model, the input data for the regression models were shuffled using a random seed of 42 and then divided into two subsets: a training set and a testing set [29]. The testing set accounted for 30% of the total data, with the remaining 70% used to train the decision tree regression model. To ensure reproducibility, a random state of 42 was set for the data splitting process. The regression models were implemented using the scikit-learn library by creating instances of the respective classes and fitting the models to the training data, which included input features (training-independent variables) and target variables (training-dependent variables). After that, the predicted target variables were derived from the test target variables. For the decision tree model, a decision tree regressor was initialized with a maximum depth of 6 (max_depth = 6) to prevent excessive growth and model overfitting. To optimize the random forest model, a tuning process involving exploring different numbers of trees was conducted. The forest process included values of 50, 250, 500, 750, and 1000, which were experimented with to identify the optimal configuration. By systematically exploring these parameters, the best random forest model that would yield accurate classification results for the dataset could be identified. For k-nearest neighbors’ regression, a range of k values from 1 to 21 was defined to find the best k value. Hyperparameter tuning was performed using cross-validation. Different values of k within a specified range were tried. The optimal k value (k = 3) was selected as that which resulted in the best R-squared score on the validation set. To address sensitivity to the feature scale, feature scaling was applied using the Standard Scaler from sklearn preprocessing, ensuring all features had zero mean and unit variance to bring them to a consistent scale. The target variable was predicted for the validation set, and the mean squared error (MSE) and R-squared score were calculated to evaluate the performance of the model. The support vector regression model is also sensitive to the scale of the features. In this study, the features of training data were scaled the using the Standard Scaler. The original model used all eight features for prediction. However, not all features may be equally important for predicting the target variable (CO₂ emissions). By performing feature selection using recursive feature elimination with a linear kernel, the top four most important features were selected, which can potentially improve the model’s generalization and prevent overfitting. In addition, fivefold cross-validation was conducted to obtain a more robust estimation of the models’ performance, with the cross_val_score function from scikit-learn used to calculate the R² and MSE scores for each fold. To evaluate the performance of the models, several metrics were calculated, including the R-squared (R²) score, mean absolute error (MAE), MSE, and root-mean-squared error (RMSE). The mean R² and MSE were determined across all folds. The R² score was computed to assess the accuracy and predictive power of the models. The MAE, MSE, and RMSE were computed to quantify the level of deviation between the predicted CO₂ and the testing CO₂ from model training and 5-fold cross-validation.

2.5. Performance Metrics

In this study, three performance metrics (i.e., MAE, MSE, and RMSE) were used to determine the usability and effectiveness of these models [30]. MAE is a widely utilized evaluation metric in machine learning and regression analysis. It quantifies the average magnitude of errors between predicted and actual values, providing a measure of the average error. MAE is typically less sensitive to outliers, making it particularly valuable in scenarios where minimizing the average error is the primary objective. MSE is another widely employed metric to evaluate the effectiveness of regression. It quantifies the average of squared differences between predicted and actual values, which emphasizes larger errors. RMSE is more easily interpretable compared to MSE and strikes a balance between highlighting larger errors and maintaining interpretability due to it possessing the same unit of measurement as the target variable. The formulas for MAE, MSE, and RMSE are indicated in Equations (3)–(5), respectively.

MAE = \frac{1}{n} \sum_{i = 1}^{600} | {\hat{y}}_{i} - y_{i} |,

(3)

MSE = \frac{1}{n} \sum_{i = 1}^{600} ({\hat{y}}_{i} - y_{i})^{2},

(4)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{600} ({\hat{y}}_{i} - y_{i})^{2}},

(5)

where i denotes the i-th variable;

{\hat{y}}_{i}

and y_i denote the testing and predicted CO₂ emissions, respectively; n denotes the total number of data points.

2.6. Sensitivity Analysis

To perform the sensitivity analysis, six energy types from different sectors (as listed in Table 1) were selected for evaluation. The analysis was carried out by iteratively perturbing the selected variables at different sensitivities. For each sensitivity level, the original values of the variable were modified accordingly, and the predictions were made using the modified dataset. The R-squared (R²) score was then calculated to assess the impact of the perturbation on the model’s performance. The results of the sensitivity analysis were collected to construct a sensitivity matrix, where each row represented a variable and each column corresponded to a specific sensitivity level. This matrix provided a comprehensive overview of the model’s performance under different perturbation scenarios. To visually present the sensitivity analysis results, a heatmap was generated. The x-axis of the heatmap represents the sensitivity factor, ranging from 0.9 to 1.1, while the y-axis displays the variables under analysis. The resulting heatmap provided an intuitive and informative visualization of the sensitivity analysis outcomes.

2.7. Statistical Analysis

The methodology utilized in this study involves conducting various statistical analyses to examine the relationships and differences among the CO₂ emissions from LCA and the testing and predicted CO₂ emissions. Specifically, the code employed functions from the scipy.stats (version 1.7.1) library to perform the Mann–Whitney U test and calculate Pearson correlation coefficients and p-values. The Mann–Whitney U test was conducted to assess the significance of differences between variables. The Mann–Whitney U test was also applied to compare the CO₂ emissions from LCA to the testing CO₂ emissions, as well as the predicted to the testing CO₂ emissions. This yielded the respective Mann–Whitney U statistic and p-value for each comparison. The Mann–Whitney U statistic represents the magnitude of the difference between the means of the compared variables, while the p-value indicates the probability of observing such differences by chance. Additionally, Pearson’s correlation coefficients and corresponding p-values were also calculated to explore the linear relationships between pairs of variables. The Pearson function was utilized to determine the correlation coefficient among the variables. The correlation coefficient reflects the strength and direction of the linear relationship, while the p-value indicates the significance of the observed correlation.

3. Results and Discussion

3.1. Statistical Description of Energy Consumption

Table 1 presents a comprehensive summary of key statistical measures, including the count, mean, median, standard deviation, standard error, minimum, and maximum values of the energy consumption data from 1973 to 2022 in this study. The mean value of coal consumption is 7.09 × 10⁴ thousand short tons, and the median value is slightly higher at 7.26 × 10⁴ thousand short tons, suggesting a relatively symmetric distribution. The standard deviation is 1.72 × 10⁴ thousand short tons, indicating variability or dispersion around the mean value. The standard error, which measures the precision of the mean estimate, is 7.01 × 10² thousand short tons. The minimum and maximum values in the dataset are 2.68 × 10⁴ and 1.06 × 10⁵ thousand short tons, respectively. For natural gas consumption, the mean value is 1.87 × 10³ billion cubic feet, reflecting the average volume of natural gas. The median value is 1.81 × 10³ billion cubic feet, suggesting a relatively symmetrical distribution. The standard deviation is 4.90 × 10² billion cubic feet, indicating variability in the data. The minimum and maximum values are 9.40 × 10² and 3.59 × 10³ billion cubic feet, respectively.

Regarding electricity consumption, the mean value over 50 years is 2.75 × 10⁵ million kilowatt-hours, and the median value is 2.90 × 10⁵ million kilowatt-hours, suggesting a symmetric distribution. The standard deviation is 7.17 × 10⁴ million kilowatt-hours, indicating variability in the data. The minimum and maximum values are 1.39 × 10⁵ and 4.24 × 10⁵ million kilowatt-hours, respectively. For petroleum consumption, in the commercial sector, the mean daily consumption is 4.79 × 10² thousand barrels, with a median of 4.45 × 10² thousand barrels. The standard deviation is 1.90 × 10² thousand barrels, reflecting variability. The range spans from 1.73 × 10² to 1.45 × 10³ thousand barrels. For the residential sector, the mean daily consumption is 8.12 × 10² thousand barrels, with a median of 7.23 × 10² thousand barrels. The standard deviation is 4.26 × 10² thousand barrels, showing variability. The consumption range extends from 2.07 × 10² to 2.94 × 10³ thousand barrels. In the industrial sector, the mean daily consumption is 4.67 × 10³ thousand barrels, with a median of 4.67 × 10³ thousand barrels. The standard deviation is 4.67 × 10² thousand barrels, indicating variability. The consumption range lies between 3.50 × 10³ and 6.06 × 10³ thousand barrels. Finally, for the transportation sector, the mean daily consumption is 1.18 × 10⁴ thousand barrels, with a median of 1.21 × 10⁴ thousand barrels.

3.2. Results of Principal Component Analysis

Figure 1a illustrates the elbow plot for the principal component analysis (PCA) conducted in this study. The explained variance section displays the percentage of variance explained by each individual component. Component 1 accounted for the highest proportion of variance, contributing approximately 48.56% to the overall variability in the dataset. Component 2 followed closely, explaining around 24.49% of the variance. Subsequent components, such as Components 3 and 4, contributed gradually decreasing proportions of variance, indicating diminishing returns in terms of explanatory power. The cumulative percentage of variance explained up to a specific number of components revealed the accumulated contribution of the principal components towards capturing the total variability in the dataset. Components 1–2 together accounted for approximately 73.05% of the variance, Components 1–3 explained around 86.19%, and Components 1–4 captured about 95.30% of the total variance. The cumulative explained variance continued to increase as additional components were included until reaching Components 1–9, which covered 100.00% of the variance. This can also be observed from the scatter plot from the inverse_transform method of the PCA model. This method reverts each component back to the original data scale. The entire range of components from 1 to 8 were superimposed onto the original data. In Figure 1b, the green data points are the original data and the red data points are the principal components. When only one component is visualized, it materializes as a collection of points projected onto an axis that aligns with the highest variance in the original dataset. As additional components are incorporated, the resemblance to the original data becomes progressively evident, despite the presentation of reduced information. The data progressively converge with the original dataset, culminating in an alignment at eight components, ultimately coinciding with the original data’s characteristics. According to the results, the optimal number of components to retain for analysis is two. Scatter plots were created to visualize the projected data in two-dimensional (Figure S2) and three-dimensional principal component space (Figure S3). Each data point was colored based on the corresponding target variable, creating a visual representation of the relationship between the principal components and the CO₂ emissions.

3.3. CO₂ Emissions from Life Cycle Assessment (LCA)

The CO₂ emission on a monthly basis from 1973 to 2022 are presented in Supplemental Data S1 and were used for the machine learning regression model. Table S1 summarizes the contribution of the total energy consumption and associated CO₂ emissions across different sectors. The data indicated the carbon footprint resulting from energy usage in the study period. The table highlights the significant energy consumption from various sources. Coal was a substantial contributor, with a total consumption of 4.25 × 10⁷ thousand short tons. It contributed 1.11 × 10¹¹ tons of CO₂ accounting for 46.11% of the total CO₂ emissions (Figure 2). Electricity also contributed significantly to the overall energy consumption which is considerable at of 1.65 × 10⁸ million kilowatt hours with 6.41 × 10¹⁰ tons of CO₂ emissions only accounting for 26.70% of the total CO₂ emissions. This is mainly attributed to the fact that electricity generation from sources other than coal can result in lower CO₂ emissions per unit of electricity produced. Furthermore, natural gas followed with a consumption of 1.12 × 10⁶ billion cubic feet with 6.12 × 10¹⁰ tons of CO₂ emissions representing a 25.49% share of the energy mix. Petroleum across different sectors contributed 1.70% of the total CO₂ emissions ranging from 9.71 × 10⁷ to 2.75 × 10⁹ tons.

3.4. Performance and Results of the Machine Learning Regression Models

Table 3 summarizes the performance of various machine learning models in predicting CO₂ emissions. The evaluation metrics used to assess the models include R² (coefficient of determination), mean absolute error (MAE), mean squared error (MSE), root-mean-squared error (RMSE), and mean R² and MSE for fivefold cross-validation. The performance and results for the machine learning regression models are summarized in Figure 3. The R² value of the decision tree model was 0.98 on the training set, indicating an excellent fit to the training data. However, it showed a lower R² value of 0.91 on the testing dataset, suggesting a minor drop in generalization ability when applied to the test data. The MSE (229.09), MAE (11.98), and RMSE (15.14) were the highest among these six machine models, indicating a poorer performance in regression tasks. The R² of 0.88 and MSE of 284.33 for fivefold cross-validation further support this. The random forest model exhibited a high accuracy, with an R² of 0.99 on the training dataset and 0.96 on the testing dataset. The MSE, MAE, and RMSE were 102.27, 7.78, and 10.11, respectively. The R² of 0.93 and MSE of 151.23 for fivefold cross-validation reflect the model’s consistent performance across different cross-validation folds. The multiple linear regression model was the best regression model, demonstrating a high degree of explanatory power, as evidenced by an R² of 0.98 on the training dataset and an R² of 0.99 on the testing dataset. The similarity of these two R² values suggests that the model’s performance on the test data is similar to its performance on the training data, indicating a good generalization ability. The low MAE of 4.64 and MSE of 38.14 highlight the model’s ability to make precise predictions. The RMSE of 6.18 further confirms the model’s accuracy. The R² of 0.98 and MSE of 39.07 for fivefold cross-validation demonstrate the model’s robustness across multiple folds. The k-nearest neighbors model also showed a relatively large difference in R² for the training (0.97) and testing (0.94) datasets, indicating a lower generalizing ability to the data compared to the previous models. The higher MAE of 8.89 and MSE of 132.92 suggest larger deviations in the model’s predictions. The RMSE of 11.53 further confirms the model’s higher level of error. The R² of 0.93 and MSE of 172.73 for fivefold cross-validation indicate a moderate performance across different cross-validation folds. The gradient boosting model exhibited a high degree of accuracy, with an R² of 1.00 on the training dataset and 0.97 on the testing dataset. The MAE of 6.79 and RMSE of 8.85 reflect the model’s ability to make precise predictions. The MSE of 78.31 further supported the model’s performance. The R² of 0.95 and MSE of 124.9 for fivefold cross-validation demonstrate the model’s consistency in capturing the underlying patterns in the data. The support vector regression model showed the second-best performance with an R² of 0.98 on the training dataset and 0.98 on the testing dataset. The MSE, MAE, and RMSE were 56.48, 5.38, and 7.52, respectively. The R² of 0.97 and MSE of 58.52 for fivefold cross-validation reflect the model’s consistent performance across different cross-validation folds.

There are several factors that may contribute to the suboptimal generalization ability of the k-nearest neighbors’ model in this study. K-nearest neighbors is usually used for classification when data cannot be effectively modeled using linear approaches, as it identifies and employs a nonlinear decision boundary instead [31]. K-nearest neighbors operate on distances between data points. To elaborate, given a dataset X with m data points and n numerical features, you compute the Euclidean distance between a test point and all training points. As the number of features (n) increases, the data points become more dispersed in the high-dimensional space. Consequently, k-nearest neighbors is affected by the curse of dimensionality, necessitating a broader search of the search space [31]. Another possible reason for the underperformance of k-nearest neighbors and support vector regression is the insufficiency of data. K-nearest neighbors generally benefits from larger datasets, as it relies on the proximity of data points for classification [32]. In this study, the testing set was allocated 30% of 600 data points, while the remaining 70% was used to train the decision tree regression model. With a small dataset, it is difficult for the models to capture the underlying patterns, and they exhibit limited predictive power.

For decision tree, the R-squared scores of the training data are significantly higher than those of the testing data, indicating that the decision tree model may be suffering from overfitting, though the maximum depth of the tree was limited to 6. Overfitting occurs when the model learns the noise and random fluctuations in the training data rather than capturing the underlying patterns that generalize well to unseen data. In addition, another reason is insufficient data. When the amount of training data are limited, the decision tree might have difficulty in generalizing to unseen data. With a small dataset, the model may memorize the training examples.

It should be noted that if the methodology of collecting energy consumption data was changed for EIA, it could cause feature distributions, resulting in inconsistencies. As a result, this could lead to biased model training, in which the model picks up patterns that are no longer valid in more current data. In addition, the correlations between predictors and CO₂ emissions can change over time if the definitions of variables such as coal, natural gas, or petroleum usage change. In addition, complex models such as random forests and k-nearest neighbors can be less transparent regarding their decision-making processes. Model transparency in public policy situations may be addressed by using methods such as feature significance, SHapley Additive exPlanations (SHAP) values, and Local Interpretable Model-agnostic Explanations (LIMEs) to describe the prediction process of random forests. By elaborating the main variables influencing the model, these techniques improve the process’s interpretability. These methods guarantee that policymakers may rely on the model’s forecasts and arrive at well-informed conclusions.

3.5. Comparing the Results of Sensitivity Analysis

The sensitivity analysis was performed on multiple variables to assess their impact on the R-squared score for the machine learning regression models in this study. The R-squared score indicated the goodness of fit of the regression model. The decision tree model showed a relatively high sensitivity to coal, natural gas, and electricity consumption, with R-squared scores ranging from 0.80 to 0.91 (Figure 4a). Petroleum from different sectors besides petroleum from the transportation sector (R-squared scores: 0.88 to 0.92) consistently had a weak influence on the model’s predictive power. Figure 4b shows that the random forest model demonstrated significant sensitivity to coal consumption, with R-squared scores ranging from 0.84 to 0.96. Natural gas and petroleum from the transportation sector also had moderate sensitivity. Other variables had a limited impact on the model’s performance. The multiple linear regression model achieved high overall R-squared scores (0.87 to 0.99). Figure 4c shows a strong sensitivity to coal, natural gas, and petroleum from the transportation sector. By contrast, the k-nearest neighbors’ model had relatively high R-squared scores (0.94 to 0.95) for all the variables (Figure 4d). The gradient boosting model also showed substantial sensitivity to coal with R-squared scores ranging from 0.84 to 0.96 (Figure 4e). For the support vector regression model, the overall R-squared scores for different energy consumptions also remained relatively high, ranging from 0.97 to 0.98 (Figure 4f). In addition, the R-squared scores remained relatively constant across different sensitivity factors. This indicated that variations in the variable have a small influence on the model’s predictive performance.

3.6. Statistical Analysis on LCA CO₂ Emissions, Test CO₂ Emissions, and Predicted CO₂ Emissions

Statistical analysis was conducted on six machine learning models, aiming to assess their performance in predicting CO₂ emissions. Table 4 reveals significant findings in terms of the Mann–Whitney U test and Pearson correlation coefficient. The Mann–Whitney U test results show that the CO₂ emissions from LCA and the testing and training data varied significantly across the different models. The Mann–Whitney U test indicated that all machine learning regression models, including decision tree, random forest, multiple linear regression, gradient boosting, and support vector regression, demonstrate a strong agreement with CO₂ emissions calculated through LCA, as evidenced by a significantly low p-value of 3.06 × 10⁻²⁰. This indicates their effectiveness in accurately predicting CO₂ emissions. Additionally, these models generally exhibit consistent performance on new, unseen data, reflected by non-significant p-values for most models (0.75, 0.96, 0.99, and 0.94). However, it is noteworthy that the k-nearest neighbors model demonstrates a contrasting behavior. The relatively low p-value of 6.08 × 10⁻⁴⁷ indicates a highly significant difference between the CO₂ emissions predicted on the testing data and those observed during training. This divergence could be attributed to several factors, such as overfitting or sensitivity to outliers in the training data.

4. Conclusions

The dataset spanning from 1973 to 2022 revealed various statistical measures for different energy sources. Notably, coal consumption was, on average, 7.09 × 10⁴ thousand short tons, while the average volume of natural gas consumption was 1.87 × 10³ billion cubic feet. Electricity consumption averaged 2.75 × 10⁵ million kilowatt-hours, and petroleum consumption varied across different sectors. Coal emerged as a substantial contributor, accounting for 46.11% of the total CO₂ emissions. Electricity and natural gas also made significant contributions, representing 26.70 and 25.49% of the total CO₂ emissions, respectively. Petroleum consumption across different sectors accounted for 1.70% of the total CO₂ emissions. The performance of various machine learning regression models in predicting CO₂ emissions was evaluated using multiple evaluation metrics. The decision tree, random forest, and multiple linear regression models demonstrated high accuracy and precision in their predictions. The k-nearest neighbors and support vector regression models exhibited lower performance, potentially due to outliers or insufficient data. Augmenting the dataset or employing cross-validation techniques could potentially enhance their performance. Sensitivity analysis revealed that the decision tree, random forest, multiple linear regression, k-nearest neighbors, and gradient boosting models showed significant sensitivity to coal consumption. On the other hand, the support vector regression model displayed a relatively low sensitivity to all energy consumption variables. These findings suggest that the machine learning models employed in this study, with the exception of the support vector regression model, were effective in predicting CO₂ emissions. The strong correlations between different datasets indicate that the models were capable of generalizing well beyond the training data. This has important implications for future research and decision-making processes related to CO₂ emissions, as these models can serve as valuable tools for accurate estimation and forecasting.

The results indicate that multiple linear regression fared better than the other models. According to the data, the multiple linear regression model shows both a high accuracy and good generalization capacity. However, a decrease in R² from the training to testing datasets indicates that decision trees tended to overfit. This implies that decision trees can have trouble with invisible data, even when they can successfully identify patterns in training data. To enhance decision tree generalization, future studies should investigate ensemble methods such as bagging and boosting. The role played by fossil fuels in GHG emissions is highlighted by the significant sensitivity of the random forest and gradient boosting models to coal usage. Policymakers could use this information to prioritize laws that target coal use and the switch to renewable energy sources.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/su17072843/s1.

Author Contributions

L.T.: Visualization, Writing—original Draft Preparation, Writing—review and Editing; Z.Z.: Funding Acquisition, Project Administration, Writing—review and Editing; Z.H. and C.Y.: Data Curation, Resources; Y.X.: Validation; K.Z.: Project Administration, Resources; R.J.: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Writing—original Draft Preparation, Writing—review and Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangzhou Agricultural Environment and Plant Protection Station (ZBZ-(NH)2024(001)), the National Key R&D Program of China (Grant No.2023YFD1301805), Guangdong Basic and Applied Basic Research Foundation (2023A1515110310), Institute of Agricultural Resources and Environment of GDAAS (ZHS2023-06), and Funding by Science and Technology Projects in Guangzhou (2025A04J4309).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Wang, B.; Liu, J. Impact of climate change on green technology innovation—An examination based on microfirm data. Sustainability 2024, 16, 11206. [Google Scholar] [CrossRef]
Arora, N.K. Impact of climate change on agriculture production and its sustainable solutions. Environ. Sustain. 2019, 2, 95–96. [Google Scholar] [CrossRef]
Hasnat, G.N.T.; Kabir, M.A.; Hossain, M.A. Major environmental issues and problems of south Asia, particularly Bangladesh. In Handbook of Environmental Materials Management; Springer: Berlin/Heidelberg, Germany, 2019; pp. 1–40. [Google Scholar] [CrossRef]
Nyong, A. Climate change impacts in the developing world: Implications for sustainable development. 2009. In Climate Change and Global Poverty: A Billion Lives in the Balance? Brookings Institution Press: Washington, DC, USA, 2009; pp. 43–64. [Google Scholar]
Wolf, J.; Moser, S.C. Individual understandings, perceptions, and engagement with climate change: Insights from in-depth studies across the world. WIREs Clim. Change 2011, 2, 547–569. [Google Scholar] [CrossRef]
Akhmat, G.; Zaman, K.; Shukui, T.; Sajjad, F.; Khan, M.A.; Khan, M.Z. The challenges of reducing greenhouse gas emissions and air pollution through energy sources: Evidence from a panel of developed countries. Environ. Sci. Pollut. Res. Int. 2014, 21, 7425–7435. [Google Scholar] [CrossRef]
Olabemiwo, F.A.; Danmaliki, G.I.; Oyehan, T.A.; Tawabini, B.S. Forecasting CO₂ emissions in the Persian gulf states. Glob. J. Environ. Sci. Manag. 2017, 3, 1–10. [Google Scholar] [CrossRef]
Wróbel-Jędrzejewska, M.; Włodarczyk, E.; Przybysz, Ł. Analysis of greenhouse gas emissions of a mill according to the greenhouse gas protocol. Sustainability 2024, 16, 11214. [Google Scholar] [CrossRef]
Khan, I.; Hou, F.; Le, H.P. The impact of natural resources, energy consumption, and population growth on environmental quality: Fresh evidence from the united states of America. Sci. Total Environ. 2021, 754, 142222. [Google Scholar] [CrossRef]
Ashraf, M.; Ayaz, M.; Khan, M.; Adil, S.F.; Farooq, W.; Ullah, N.; Nawaz Tahir, M. Recent trends in sustainable solar energy conversion technologies: Mechanisms, prospects, and challenges. Energy Fuels 2023, 37, 6283–6301. [Google Scholar] [CrossRef]
Zhang, D.; Chen, X.H.; Lau, C.K.M.; Xu, B. Implications of cryptocurrency energy usage on climate change. Technol. Forecast. Soc. Change 2023, 187, 122219. [Google Scholar] [CrossRef]
Shi, X.; Wang, K.; Cheong, T.S.; Zhang, H. Prioritizing driving factors of household carbon emissions: An application of the lasso model with survey data. Energy Econ. 2020, 92, 104942. [Google Scholar] [CrossRef]
Christin, S.; Hervet, É.; Lecomte, N. Applications for deep learning in ecology. Methods Ecol. Evol. 2019, 10, 1632–1644. [Google Scholar] [CrossRef]
Zheng, L.; Lin, R.; Wang, X.; Chen, W. The development and application of machine learning in atmospheric environment studies. Remote Sens. 2021, 13, 4839. [Google Scholar] [CrossRef]
Kumari, S.; Singh, S.K. Machine learning-based time series models for effective CO₂ emission prediction in India. Environ. Sci. Pollut. Res. Int. 2023, 30, 116601–116616. [Google Scholar] [CrossRef]
Ahmed, M.; Shuai, C.; Ahmed, M. Analysis of energy consumption and greenhouse gas emissions trend in China, India, the USA, and Russia. Int. J. Environ. Sci. Technol. 2023, 20, 2683–2698. [Google Scholar] [CrossRef]
Yamaka, W.; Phadkantha, R.; Rakpho, P. Economic and energy impacts on greenhouse gas emissions: A case study of China and the USA. Energy Rep. 2021, 7, 240–247. [Google Scholar] [CrossRef]
Mardani, A.; Liao, H.; Nilashi, M.; Alrasheedi, M.; Cavallaro, F. A multi-stage method to predict carbon dioxide emissions using dimensionality reduction, clustering, and machine learning techniques. J. Clean. Prod. 2020, 275, 122942. [Google Scholar] [CrossRef]
Arshad, M.Y.; Saeed, S.; Raza, A.; Ahmad, A.S.; Urbanowska, A.; Jackowski, M.; Niedzwiecki, L. Integrating life cycle assessment and machine learning to enhance black soldier fly larvae-based composting of kitchen waste. Sustainability 2023, 15, 12475. [Google Scholar] [CrossRef]
Ghoroghi, A.; Rezgui, Y.; Petri, I.; Beach, T. Advances in application of machine learning to life cycle assessment: A literature review. Int. J. Life Cycle Assess. 2022, 27, 433–456. [Google Scholar] [CrossRef]
Zhong, S.; Zhang, K.; Bagheri, M.; Burken, J.G.; Gu, A.; Li, B.; Ma, X.; Marrone, B.L.; Ren, Z.J.; Schrier, J.; et al. Machine learning: New ideas and tools in environmental science and engineering. Environ. Sci. Technol. 2021, 55, 12741–12754. [Google Scholar] [CrossRef]
Bormpoudakis, D.; Sueur, J.; Pantis, J.D. Spatial heterogeneity of ambient sound at the habitat type level: Ecological implications and applications. Landsc. Ecol. 2013, 28, 495–506. [Google Scholar] [CrossRef]
Ajala, A.A.; Adeoye, O.L.; Salami, O.M.; Jimoh, A.Y. An examination of daily co2 emissions prediction through a comparative analysis of machine learning, deep learning, and statistical models. Environ. Sci. Pollut. Res. Int. 2025, 32, 2510–2535. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.; Zhong, W.; Liu, Y.; Lu, R.; Peng, X. Incorporating Large-Scale Economic-Environmental-Energy Coupling Assessment and Collaborative Optimization into Sustainable Product Footprint Management: A Graph-Assisted Life Cycle Energy Efficiency Enhancement Approach. Energy Convers. Manag. 2025, 329, 119616. [Google Scholar] [CrossRef]
Kluyver, T.; Ragan Kelley, B.; Pérez, F.; Granger, B.; Bussonnier, M.; Frederic, J.; Kelley, K.; Hamrick, J.; Grout, J.; Corlay, S.; et al. Jupyter notebooks—A publishing format for reproducible computational workflows. Position. Power Acad. Publ. Play. Agents Agendas 2016, 0, 87–90. [Google Scholar] [CrossRef]
EPA. Emission Factors for Greenhouse Gas Inventories. 2023. Available online: https://www.epa.gov/climateleadership/ghg-emission-factors-hub (accessed on 10 March 2025).
EIA. How Much Carbon Dioxide Is Produced Per Kilowatthour of U.S. Electricity Generation? 2022. Available online: https://www.eia.gov/tools/faqs/faq.php?id=74&t=11 (accessed on 10 March 2025).
EIA. Monthly Energy Review. 2016. Available online: https://www.eia.gov/totalenergy/data/monthly/ (accessed on 10 March 2025).
Allgaier, J.; Neff, P.; Schlee, W.; Schoisswohl, S.; Pryss, R. Deep learning end-to-end approach for the prediction of tinnitus based on eeg data. In Proceeding of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society, Online Conference, 1–5 November 2021; pp. 816–819. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef]
Dhriti, D.; Kaur, M. K-nearest neighbor classification approach for face and fingerprint at feature level fusion. Int. J. Comput. Appl. 2012, 60, 13–17. [Google Scholar] [CrossRef]
Song, G.; Rochas, J.; Beze, L.E.; Huet, F.; Magoulès, F. K nearest neighbour joins for big data on mapreduce: A theoretical and experimental analysis. IEEE Trans. Knowl. Data Eng. 2016, 28, 2376–2392. [Google Scholar] [CrossRef]

Figure 1. (a) Elbow plot for the principal component analysis and (b) scatter plots for each component data vs. original data.

Figure 2. Distribution of each energy to total GHG emissions from 1973 to 2022 in the USA.

Figure 3. Summary of performance and results for different machine learning regression models.

Figure 4. Results of sensitivity analysis for different machine learning regression models.

Table 1. Descriptive analysis of the energy consumption and CO₂ emission in the USA over 50 years.

	Coal	Natural Gas	Electricity	Petroleum (Commercial Sector)	Petroleum (Residential Sector)	Petroleum (Industrial Sector)	Petroleum (Transportation Sector)	Petroleum (Electric Power Sector)
	Thousand Short Tons	Billion Cubic Feet	Million Kilowatt Hours	Thousand Barrels per Day	Thousand Barrels per Day	Thousand Barrels per Day	Thousand Barrels per Day	Thousand Barrels per Day
Count	6.00 × 10²	6.00 × 10²	6.00 × 10²	6.00 × 10²	6.00 × 10²	6.00 × 10²	6.00 × 10²	6.00 × 10²
Mean	7.09 × 10⁴	1.87 × 10³	2.75 × 10⁵	4.79 × 10²	8.12 × 10²	4.67 × 10³	1.18 × 10⁴	5.65 × 10²
Median	7.26 × 10⁴	1.81 × 10³	2.90 × 10⁵	4.45 × 10²	7.23 × 10²	4.67 × 10³	1.21× 10⁴	4.49 × 10²
Standard deviation	1.72 × 10⁴	4.90 × 10²	7.17 × 10⁴	1.90 × 10²	4.26 × 10²	4.67 × 10²	1.81 × 10³	4.83 × 10²
Standard error	7.01× 10²	0.20 × 10²	2.93 × 10³	0.08× 10²	0.17 × 10²	0.19 × 10²	0.74 × 10²	0.20 × 10²
Minimum	2.68 × 10⁴	9.40 × 10²	1.39 × 10⁵	1.73 × 10²	2.07 × 10²	3.50 × 10³	8.14 × 10³	0.60 × 10²
Maximum	1.06 × 10⁵	3.59 × 10³	4.24 × 10⁵	1.45 × 10³	2.95 × 10³	6.06 × 10³	1.50 × 10⁴	2.45 × 10³

Table 2. Emission factors for LCA approach in this study.

Energy	Emission Factor	Unit
Anthracite Coal	2602	kg CO₂ per short ton
Natural Gas (per scf)	0.054	kg CO₂ per standard cubic foot (scf)
Motor Gasoline	8.78	kg CO₂ per gallon
Distillate Fuel Oil No. 1	10.18	kg CO₂ per gallon
Aviation Gasoline	8.31	kg CO₂ per gallon
Jet Fuel (kerosene type)	9.75	kg CO₂ per gallon
Butane	6.67	kg CO₂ per gallon
Ethane	4.05	kg CO₂ per gallon
Propane	5.72	kg CO₂ per gallon
Pentanes	7.7	kg CO₂ per gallon
Other Oil (>401 deg F)	10.59	kg CO₂ per gallon
Electricity	0.388	kg CO₂ per kWh
Petroleum Consumed by the Residential Sector	8.054	kg CO₂ per gallon
Petroleum Consumed by the Commercial Sector	8.054	kg CO₂ per gallon
Petroleum Consumed by the Industrial Sector	8.095	kg CO₂ per gallon
Petroleum Consumed by the Transportation Sector	9.226	kg CO₂ per gallon
Petroleum Consumed by the Electric Power Sector	8.054	kg CO₂ per gallon

Table 3. Summarization of the performance for different machine learning models in predicting the CO₂ emissions.

	R²_Training	R²_Testing	Mean Absolute Error	Mean Squared Error	Root Mean Squared Error	Mean 5-Fold Cross-Validation R²	Mean 5-Fold Cross-Validation MSE
Decision tree	0.96	0.91	12.35	232.86	15.26	0.88	284.33
Random forest	0.99	0.96	7.78	102.27	10.11	0.93	151.23
Multiple linear regression	0.98	0.99	4.64	38.14	6.18	0.98	39.07
K-nearest neighbors	0.97	0.94	8.89	132.92	11.53	0.93	172.73
Gradient boosting	1.00	0.96	7.78	102.03	10.10	0.95	124.9
Support vector regression	0.98	0.98	5.38	56.48	7.52	0.97	58.52

Table 4. Summarization of Mann–Whitney U test results for different machine learning models.

	CO₂ from LCA vs. CO₂ from Training Data	p-Value	CO₂ from Testing Data vs. CO₂ from Training Data	p-Value
Decision tree	7100	3.06 × 10⁻²⁰	15,887	0.75
Random forest	7100	3.06 × 10⁻²⁰	16,144	0.96
Multiple linear regression	7100	3.06 × 10⁻²⁰	16,211	0.99
K-nearest neighbors	7100	3.06 × 10⁻²⁰	2288	6.08 × 10⁻⁴⁷
Gradient boosting	7100	3.06 × 10⁻²⁰	16,280	0.94
Support vector regression	7100	3.06 × 10⁻²⁰	16,185	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, L.; Zhang, Z.; He, Z.; Yuan, C.; Xie, Y.; Zhang, K.; Jing, R. Predicting Energy-Based CO₂ Emissions in the United States Using Machine Learning: A Path Toward Mitigating Climate Change. Sustainability 2025, 17, 2843. https://doi.org/10.3390/su17072843

AMA Style

Tian L, Zhang Z, He Z, Yuan C, Xie Y, Zhang K, Jing R. Predicting Energy-Based CO₂ Emissions in the United States Using Machine Learning: A Path Toward Mitigating Climate Change. Sustainability. 2025; 17(7):2843. https://doi.org/10.3390/su17072843

Chicago/Turabian Style

Tian, Longfei, Zhen Zhang, Zhiru He, Chen Yuan, Yinghui Xie, Kun Zhang, and Ran Jing. 2025. "Predicting Energy-Based CO₂ Emissions in the United States Using Machine Learning: A Path Toward Mitigating Climate Change" Sustainability 17, no. 7: 2843. https://doi.org/10.3390/su17072843

APA Style

Tian, L., Zhang, Z., He, Z., Yuan, C., Xie, Y., Zhang, K., & Jing, R. (2025). Predicting Energy-Based CO₂ Emissions in the United States Using Machine Learning: A Path Toward Mitigating Climate Change. Sustainability, 17(7), 2843. https://doi.org/10.3390/su17072843

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Energy-Based CO₂ Emissions in the United States Using Machine Learning: A Path Toward Mitigating Climate Change

Abstract

1. Introduction