Aggregated Catalyst Physicochemical Descriptor-Driven Machine Learning for Catalyst Optimization: Insights into Oxidative-Coupling-of-Methane Dynamics and C2 Yields

Ezz, Mohamed; Mostafa, Ayman Mohamed; S. Alaerjan, Alaa; Allahem, Hisham; Aldughayfiq, Bader; M. A. Hassan, Hassan; M. K. Mohamed, Rasha

doi:10.3390/catal15040378

Open AccessArticle

Aggregated Catalyst Physicochemical Descriptor-Driven Machine Learning for Catalyst Optimization: Insights into Oxidative-Coupling-of-Methane Dynamics and C2 Yields

by

Mohamed Ezz

¹

,

Ayman Mohamed Mostafa

^2,*

,

Alaa S. Alaerjan

¹

,

Hisham Allahem

²

,

Bader Aldughayfiq

²

,

Hassan M. A. Hassan

³

and

Rasha M. K. Mohamed

³

¹

Department of Computer Science, College of Computer and Information Sciences, Jouf University, Sakaka 72388, Saudi Arabia

²

Information Systems Department, College of Computer and Information Sciences, Jouf University, Sakaka 72388, Saudi Arabia

³

Department of Chemistry, College of Science, Jouf University, Sakaka 72388, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Catalysts 2025, 15(4), 378; https://doi.org/10.3390/catal15040378

Submission received: 11 February 2025 / Revised: 4 April 2025 / Accepted: 11 April 2025 / Published: 13 April 2025

(This article belongs to the Section Computational Catalysis)

Download

Browse Figures

Versions Notes

Abstract

:

This study focuses on optimizing C2 yields in the oxidative coupling of methane (OCM), a pivotal process for sustainable chemical production. By harnessing advanced machine learning (ML) techniques, this research aimed to predict C2 yields and identify the factors that drive catalytic performance. The Extra Trees Regressor emerged as the most effective model after a comprehensive evaluation across multiple datasets and methodologies. Key to the method was the use of an innovative Aggregated Catalyst Physicochemical Descriptor (ACPD) and stratified cross-validation, which effectively addressed feature complexity and target skewness. Hyperparameter optimization using Modified Sequential Model-Based Optimization (SMBO) further enhanced the model’s performance, achieving optimized R² values of 61.7%, 75.9%, and 92.0% for datasets A, B, and C, respectively, with corresponding reductions in the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE). Additionally, SHAP (SHapley Additive exPlanations) analysis provided a detailed understanding of the model’s decision-making process, revealing the relative importance of individual features and their contributions to the predictive outcomes. This research not only achieved state-of-the-art predictive accuracy, but also deepened our understanding of the underlying chemical dynamics, offering practical guidance for catalyst design and operational optimization. These findings mark a significant advancement in catalysis, paving the way for future innovations in sustainable chemical manufacturing.

Keywords:

C2 yields; oxidative coupling of methane; machine learning; catalyst performance; SHAP analysis

1. Introduction

The exploration of innovative methods to convert methane—the most fundamental alkane and a key component of natural gas—into higher-value chemicals remains a critical area in contemporary chemical research. Among these methods, the oxidative coupling of methane (OCM) emerges as a promising strategy, offering a direct route to ethylene, a vital precursor in the petrochemical industry. However, the OCM process is not without its challenges, such as limited selectivity, modest conversion rates, and the requirement for elevated operational temperatures. These hurdles underscore the pressing need for the development of superior catalysts and optimized reaction conditions to improve the efficiency and commercial viability of OCM processes.

In this study, we aim to address these complexities by leveraging machine learning (ML) to gain deeper insights into catalytic systems and to optimize C2 yields. The following critical research questions guide our investigation:

Model Selection and Stratified Validation: How does the selection of an optimal machine learning model, combined with Cross-Validation using Stratified Regression, enhance our ability to predict C2 yields? Additionally, what insights can this provide into the target variable’s behavior across diverse catalyst compositions and conditions?
Feature Aggregation and Hyperparameter Optimization: How does the innovative Aggregation of Physicochemical Descriptors (ACPDs) streamline feature representation, and how does hyperparameter tuning using frameworks, like Modified Sequential Model-Based Optimization (SMBO), improve the predictive model’s performance? Furthermore, which features and hyperparameters are identified as the most influential, and how do they impact the model’s interpretability and accuracy?
SHAP Analysis for Interpretability: How does SHAP (SHapley Additive exPlanations) analysis provide comprehensive insights into the contributions of individual features to the model’s predictions? What can this analysis reveal about the complex interactions and nonlinear relationships among features, and how can these findings inform catalyst design and operational optimization?

These research questions form the foundation of a systematic approach that spans from model selection and validation to feature engineering and in-depth interpretative analysis. By integrating state-of-the-art machine learning techniques with rigorous evaluation and interpretability frameworks, this study aims not only to enhance the predictive accuracy of C2 yield models, but also to provide actionable insights into the rational design of catalysts and the optimization of OCM processes. In seeking answers to these questions, this research builds upon the transformative advancements in machine learning (ML) applications for catalysis research, which have unlocked novel insights into data patterns often inaccessible through traditional experimentation. Inspired by the predictive power demonstrated in [1,2], this study aims to elucidate the intricate interplays within catalyst systems to identify groundbreaking catalysts that can revolutionize the OCM process. The insights drawn from these works underscore the ability of ML to address the inherent complexities of catalytic systems, paving the way for enhanced performance and efficiency.

Additionally, the methodologies in this research are informed by contemporary studies that highlight ML’s role in heterogeneous catalyst design, such as the approaches detailed in [3], and the integration of ML with Density Functional Theory (DFT) calculations to predict catalytic performance, as explored in [4]. These examples demonstrate the capability of ML to unravel complex interactions and provide actionable insights that are integral to advancing the OCM process. Furthermore, kinetic modeling and high-throughput experimental techniques have significantly contributed to the understanding of OCM, as evidenced by recent research efforts. For example, the optimization of process variables and their effects on OCM efficiency, as discussed in [5], offer a foundational perspective for refining operational conditions. Similarly, the work presented in [6], which explores the impact of Mg, Ca, Sr, and Ba dopants on La₂O₃ catalysts, provides valuable insights into how compositional changes influence catalytic performance. These advancements underscore the critical role of empirical evaluations in complementing data-driven predictions to refine the discovery of catalysts and optimize processes.

The oxidative coupling of methane (OCM) is fundamentally complex due to the delicate balance required to selectively activate methane molecules for the formation of valuable ethylene (C2) while minimizing over-oxidation to undesired byproducts, such as carbon dioxide (CO₂) and carbon monoxide (CO). Catalyst effectiveness heavily relies on the intricate interplay between catalyst composition, surface properties, and reaction conditions. The elevated temperatures are typically necessary for OCM to accelerate both desired and undesired pathways, highlighting the importance of precise control over reaction parameters. Moreover, achieving high selectivity toward ethylene remains a challenge due to competing side reactions, catalyst deactivation, and thermal instability. Thus, advanced catalytic systems informed by detailed insights into catalyst structure–performance relationships and reaction dynamics are crucial for the practical and commercially viable operation of OCM.

By synthesizing these diverse research strands, this study aims to develop a comprehensive framework that bridges machine learning predictions, kinetic modeling, and experimental data. This interdisciplinary approach not only enhances the predictive power of ML models, but also provides actionable insights to inform the design and empirical validation of novel catalysts. In doing so, it aspires to transform the OCM process from a promising theoretical concept into a practical, commercially viable technology.

2. Related Work

This paper enhances the oxidative-coupling-of-methane (OCM) process using machine learning (ML). The goal is to increase the production of ethylene, a valuable chemical. This research focuses on developing a predictive model to optimize catalyst composition and reaction conditions by analyzing various factors influencing C2 yields. By combining machine learning with kinetic modeling and experimental data, this study aims to enhance the overall efficiency and commercial viability of the OCM process. The authors of [1] evaluated machine learning (ML) predictions for OCM under experimental conditions, discovering previously unreported catalyst combinations and assessing the challenges in achieving higher C2 yields through a literature-data-driven ML approach. This study emphasized the need for improved ML prediction accuracy through better classification and the trend analysis of the experimental data. An updated dataset of 4759 experimental data points was constructed and analyzed using several ML methods. This study highlighted the exploration of Mn/Na₂WO₄/SiO₂ catalyst systems and identified promising candidates for catalytic systems through machine learning (ML) models [2]. As presented in [3], the authors assessed ML-assisted catalyst investigations for OCM using published datasets and systematic high-throughput screening to uncover insights into heterogeneous catalyst design. Furthermore, the authors of [4] utilized ML models and DFT calculations to predict catalytic performance and propose novel bimetallic combinations for effective methane conversion at low temperatures.

As presented in [5], the authors employed high-throughput screening (HTS) and statistical design of experiments (DoE) to investigate the impact of operating parameters on OCM, aiming to optimize the process and enhance the yield and selectivity of C2 products. The authors of [6] investigated lanthanum oxide catalysts enhanced with alkaline–earth metal oxides for OCM, identifying La-Sr and La-Ba as promising catalysts. In addition, the authors of [7] applied multi-component La₂O₃-based catalysts for OCM designed using HTS and literature datasets with ML approaches. This study aimed at improving catalyst design through data-driven methods. As presented in [8], catalyst informatics are combined with high-throughput experimental data to understand the OCM reaction. Machine learning was employed to bridge the gaps between the experimental data points, facilitating a detailed understanding of the reaction. The authors of [9] combined deep learning with high-throughput experimental methods to identify and evaluate active catalysts for OCM, discovering highly active, previously unreported catalysts that mark a milestone in the application of AI in catalysis.

In addition, the authors of [10] reported the use of Pt nanoparticles and CuOx clusters on TiO₂ for OCM in a flow reactor at room temperature, achieving a high yield and selectivity of C2 products, while the authors of [11] utilized experimental data alongside data science techniques to map out the OCM reaction network, establishing a connection between experimental conditions and selectivity, thereby enriching the strategic approach to catalyst and process optimization. The impact of mass and heat transfer was addressed in a reactor design for OCM, emphasizing the importance of understanding these effects for the development of efficient catalytic processes as proposed in [12]. The authors of [13] also reported the high catalytic activity of Li₂CaSiO₄ for OCM, comparing its performance with traditional catalysts and suggesting a crystallographic design concept for OCM catalysts. Furthermore, the authors of [14] analyzed a diverse catalyst dataset from high-throughput experiments, extracting valuable design heuristics for developing catalysts with balanced activity and selectivity, illustrating the power of data analysis in guiding catalyst development strategies, especially highlighting a mixed support approach between La₂O₃ and BaO as a fruitful direction for future catalyst development. Delving into the limitations of the MnO_x-Na₂WO₄/SiO₂ system for OCM at low temperatures, Si et al. [15] not only identified the challenges but also proposed a variant of the catalyst that is active at lower temperatures. This study sheds light on the intricate relationship between catalyst composition and operational temperature, providing valuable insights into overcoming performance bottlenecks.

The field of oxidative coupling of methane (OCM) has seen significant advancements through the integration of machine learning (ML) techniques for catalyst discovery, optimization, and the understanding of reaction mechanisms. The contributions and findings of recent pivotal studies are presented, providing a comprehensive overview that illustrates the progression and current state of machine learning (ML) applications in the OCM research. As presented in [1], the authors undertook a literature-data-driven machine learning (ML) exploration to predict OCM outcomes under various experimental conditions. Their analysis uncovered novel catalyst combinations and pinpointed obstacles to achieving higher C2 yields. A crucial takeaway was the imperative to enhance ML prediction capabilities through more refined data classification and trend analysis, aiming for a more precise identification of promising catalysts. Furthermore, the authors of [2] constructed an extensive dataset comprising 4759 experimental data points, scrutinized through various ML methodologies. It underscored the potential of Mn/Na₂WO₄/SiO₂ catalyst systems, revealing new catalyst candidates for OCM, thereby demonstrating the power of extrapolative ML methods in catalyst exploration. By employing published datasets and systematic high-throughput screening, Nishimura et al. [16] advanced the understanding of a heterogeneous catalyst design for OCM. This approach facilitated the discovery of insightful correlations and design principles, demonstrating the utility of machine learning in streamlining catalyst evaluation processes. The collaborative effort led by Nishimura et al. [7] focused on the data-driven design of La₂O₃-based catalysts, utilizing HTS and ML to navigate the complex landscape of catalyst optimization, highlighting the efficacy of integrating computational and experimental strategies. The authors of [8] combined catalyst informatics with high-throughput experimental data to elucidate the gas-phase OCM mechanism. ML filled the gaps in the experimental datasets, providing a detailed reaction landscape and fostering a deeper understanding of OCM dynamics. In addition, the authors of [17] integrated ML models and DFT calculations to predict catalytic performance, proposing novel bimetallic catalysts for efficient methane conversion at lower temperatures. This study highlighted the synergistic potential of computational chemistry and ML in catalysis.

As presented in [10], the authors reported on Pt nanoparticles and CuOx clusters on TiO₂ for OCM, achieving high yields and selectivities for C2 products at room temperature in a flow reactor, thereby showcasing the potential of photocatalysis in OCM. In contrast, the authors of [18] explored the effects of Mg, Ca, Sr, and Ba dopants on La₂O₃ catalysts, identifying La-Sr and La-Ba combinations as particularly promising, offering new avenues for catalyst enhancement. The authors of [19] employed data science and machine learning (ML) to investigate and identify suitable catalysts for the oxidative coupling of methane (OCM). A review of 1868 OCM catalysts from the literature data seeks to identify essential descriptors that affect C2 yield production during the OCM reaction. The trained machine learning model predicts 56 new catalysts whose first-principles calculations have proven to activate CH₄, CH₃, and O₂ while achieving C2 yields exceeding 30%. The research demonstrates that machine learning has the potential to enhance the discovery and design of catalysts for OCM and similar chemical processes. The authors of [1] conducted an evaluation of machine learning (ML) predictions for the oxidative coupling of methane (OCM) by testing 96 catalysts obtained from the literature and conducting verification trials. The research identifies new catalyst pairs, but faces difficulties in achieving C2 yields beyond 30% due to changes in the reactor system. The study demonstrates that data classification techniques in high dimensions enable ML predictions to outperform basic interpolation methods in accuracy.

As presented in [20], a high-throughput screening system was developed for oxidative coupling of methane (OCM), generating a dataset of 12,708 data points for 59 catalysts. It highlights the importance of optimizing both catalyst and reactor design to improve C2 yield and demonstrates that a consistent dataset enables accurate yield predictions using ML. As presented in [21], the authors explored the limitations of a combinatorial catalyst design through the random sampling of 300 quaternary solid catalysts from 36,540 possible candidates to evaluate their performance in the oxidative coupling of methane (OCM) using high-throughput screening. The research demonstrates that synergistic element pairings are crucial to catalyst design, and these pairings can be grouped according to the periodic table’s organization. Decision tree classification serves as a method to enhance the selection of catalysts that produce higher C2 yields. In addition, the authors of [22] evaluated the importance of isothermal data when selecting oxidative-coupling-of-methane (OCM) catalysts that operate in adiabatic reactors. Under oxygen-lean conditions, where CH₄/O₂ ratios exceed 7, the catalyst performance can be more accurately estimated because lower surface oxygen levels result in higher C2+ selectivity. The research provides experimental recommendations and data analysis findings that will aid future benchmarking efforts in industrial settings. The authors of [23] introduced a meta-analysis procedure to discover catalyst performance–property relationships through data synthesis from the literature and textbooks combined with statistical methods. The approach demonstrates that optimal catalysts for the oxidative coupling of methane require stable carbonate structures in conjunction with stable oxide support materials for successful operation. The authors of [24] introduced an optimization method that combines kinetic simulations with statistical analysis to develop heterogeneous catalysts. The research demonstrates that the electronic properties of surface oxygen species play a crucial role in determining catalyst performance during the oxidative coupling of methane (OCM). Furthermore, the authors of [25] developed a single method that enables the interpretation of intricate machine learning model predictions. SHAP applies importance values to features for specific predictions while establishing a new system of additive feature importance metrics. The introduction of SHAP-based techniques improves both computational speed and human-intuitive consistency, enabling a better balance of accuracy and interpretability.

Studies have consistently shown that OCM performance is sensitive to several critical chemical parameters, including catalyst elemental composition, oxidation states, catalyst-support interactions, and structural stability under reaction conditions. Catalyst promoters, such as alkali metals (Li, Na) and alkaline earth metals (Ba, Sr), have been frequently utilized to enhance ethylene selectivity by modifying the catalyst surface properties and reaction pathways. However, their precise roles vary significantly depending on catalyst compositions, highlighting the complex chemical environment within OCM systems. Additionally, recent experimental and kinetic modeling studies have highlighted the impact of heat and mass transfer limitations on catalyst performance, noting that these factors can significantly influence reaction rates, selectivity, and catalyst lifetime. Therefore, coupling detailed chemical experimentation with computational predictive models is essential to advance catalyst discovery and optimize operational parameters in the OCM research.

These studies collectively underscore the transformative impact of machine learning and computational approaches on the field of OCM, enabling the discovery of novel catalysts, optimization of reaction conditions, and a deeper understanding of the underlying reaction mechanisms. By leveraging big data and high-throughput experimentation, as well as integrating advanced computational models, the research community is progressively unveiling the complexities of OCM, pushing the boundaries of what is achievable in catalytic methane conversion. This body of work not only highlights the importance of interdisciplinary approaches in catalysis research, but also sets a foundation for future explorations aimed at sustainable and efficient methane utilization strategies.

3. Results

Our comprehensive methodology, which combines innovative feature engineering, stratification, and optimization, yielded insightful findings in predictive modeling for oxidative-coupling-of-methane (OCM) reactions. This section details the outcomes and implications of our experiments.

3.1. Experiment 1: Baseline Model Evaluation

In this experiment, we evaluated multiple regression models to establish a baseline for predictive performance, as explained in Table 1. The models were applied to datasets A, B, and C, and their performance metrics (R², MSE, and RMSE) were recorded.

Based on the results in Table 1, the best performance and winning models are as follows:

Dataset A: LightGBM Regressor achieved the best R² of 0.5594, MSE of 19.45, and RMSE of 4.41.
Dataset B: Extra Trees Regressor performed the best, with an R² of 0.7064, MSE of 13.85, and RMSE of 3.72.
Dataset C: Extra Trees Regressor excelled, achieving an R² of 0.9129, MSE of 1.09, and RMSE of 1.05.

The winning models for each dataset provided a robust starting point for subsequent experiments. These models demonstrated the ability to capture complex patterns, setting the stage for feature engineering and optimization in later experiments.

3.2. Experiment 2: Incorporating Stratification with CVSR

As presented in Table 2, this experiment introduced Cross-Validation with Stratified Regression (CVSR), ensuring a balanced representation of target values across cross-validation folds. Stratification was particularly beneficial for datasets A and B, where the target variable exhibited skewness.

Stratification improved performance for datasets A and B by addressing target skewness and ensuring balanced data distributions during training and validation.
For dataset C, stratification maintained a good performance due to its already well-structured target distribution.
The models showed improved generalizability and robustness, validating the importance of balanced cross-validation for skewed datasets.

3.3. Experiment 3: Feature Aggregation with ACPDs

To address scalability and reduce feature dimensionality, we introduced the Aggregated Catalyst Physicochemical Descriptor (ACPD) as presented in Table 3. This approach aggregates elemental descriptors into a single feature set, ensuring the effective representation of catalyst compositions.

ACPDs significantly improved the model’s performance for all datasets by consolidating the relevant features and reducing noise.
The Extra Trees Regressor achieved the highest R² values, demonstrating its compatibility with the ACPD approach in leveraging the aggregated information effectively.
This experiment validated the power of innovative feature engineering in improving scalability, interpretability, and predictive accuracy.

3.4. Experiment 4: Optimization with SMBO

In this experiment, we applied Modified Sequential Model-Based Optimization (SMBO) to fine-tune hyperparameters for the best-performing models. Below are the results as explained in Table 4.

The SMBO process demonstrated significant improvements in model performance for datasets A, B, and C by systematically fine-tuning critical hyperparameters. For dataset A presented in Table 5 (LightGBM Regressor), optimal values were identified for parameters such as num_leaves (50), learning_rate (0.1), subsample (1.0), colsample_bytree (0.6), and n_estimators (200). These adjustments enhanced model complexity and efficiency, achieving a peak R² of 0.6168.

For datasets B and C (Extra Trees Regressor) presented in Table 6, parameters like max_depth (None for B, 30 for C), max_features (1), min_samples_split (5 for B, 2 for C), min_samples_leaf (1), and n_estimators (500 for B, 200 for C) were fine-tuned. These refinements allowed the models to generalize effectively, achieving R² scores of 0.7588 for dataset B and 0.9204 for dataset C.

Table 7 summarizes the hyperparameters explored and their optimal values determined by SMBO for the LightGBM (dataset A) and Extra Trees (datasets B and C) models. It includes key parameters, such as num_leaves, learning_rate, max_depth, max_features, and n_estimators, clearly presenting the optimal settings for each dataset to ensure transparency and reproducibility.

Overall, SMBO enhanced model complexity, generalization, and stability, ensuring reliable and accurate predictions. These results underscore the importance of hyperparameter tuning for achieving a peak performance tailored to each dataset’s unique characteristics.

3.5. Summary of Achievements

This multi-stage experimental approach highlights the significant impact of combining stratification, feature engineering, and hyperparameter optimization in advancing predictive modeling for oxidative-coupling-of-methane (OCM) processes. A key achievement was enhanced generalizability, as stratification effectively addressed skewness in the datasets, enabling the models to generalize better across varying data distributions. Furthermore, the introduction of the Aggregated Catalyst Physicochemical Descriptor (ACPD) approach streamlined data representation while preserving critical information, resulting in substantial performance improvements. The targeted optimization through Modified Sequential Model-Based Optimization (SMBO) played a pivotal role in fine-tuning essential parameters of the model, achieving the highest R² scores and ensuring model stability. Additionally, the integration of these advanced techniques into a comprehensive framework enabled the development of reliable and interpretable predictions, establishing a new benchmark for the research in catalysis. Overall, this study underscores the transformative potential of machine learning in catalysis, offering a robust and scalable framework for the rational design and optimization of novel catalysts.

The plots in Figure 1 visually emphasize the progressive improvements achieved across datasets and models, showcasing the success of the applied methodologies in enhancing model reliability and prediction accuracy.

3.6. Comparison with Related Work

To contextualize our findings, we compared our results with prior studies that utilized dataset A, dataset B, and dataset C as presented in Table 8. Dataset A [26], employed an exploitative machine learning approach, achieving an R² of 0.583. In contrast, our optimized LightGBM Regressor achieved a higher R² of 0.617, demonstrating improved scalability and accuracy through the use of Aggregated Catalyst Physicochemical Descriptors (ACPDs) and stratification. Similarly, dataset B [2], achieved an R² of 0.736, while our methodology, incorporating ACPDs and SMBO, outperformed it with an R² of 0.759, highlighting enhanced generalizability and robustness. Lastly, the authors of [20] achieved an R² of approximately 0.91 for dataset C using high-throughput experiments with machine learning. Our study achieved a slightly higher R² of 0.92 while offering flexibility to integrate diverse datasets, showcasing the versatility of our approach.

This analysis demonstrates the superiority of our methodology, achieving higher or comparable R² values across all datasets. The introduction of the ACPD, stratification, and SMBO enhanced scalability, generalizability, and adaptability, marking a significant advancement in predictive modeling for catalytic systems.

4. Discussion

In this section, we delve into the interpretability of our predictive model using the SHAP (SHapley Additive exPlanations) analysis, focusing on the final model that combines stratification, ACPD feature aggregation, and SMBO. This final model was trained on dataset B, which was chosen for its rich diversity in elemental compositions compared to other datasets, providing a broader scope for exploration and analysis. The application of SHAP analysis provides a powerful lens for understanding the impact of individual features on the model’s predictions, enabling us to uncover the intricate relationships between variables and their contributions to catalytic performance.

4.1. Detailed Catalyst Feature Influence Unraveled by SHAP Analysis

The elucidation of our model’s decision-making process, achieved through SHAP Explainer, revealed the comprehensive anatomy of the catalyst’s influential features. The SHAP analysis meticulously dissected the influence hierarchy of the catalyst features within the model, revealing a layered structure of impact and significance. The integration of Figure 2 into our discussion offers a compelling visual narrative that pairs the average impact of features with their distribution and variability across predictions.

Figure 2 concisely conveys the average magnitude of each feature’s impact on the model output. The methane-to-oxygen partial pressure ratio

p ({C H}_{4}) / p (O_{2})

emerges as the most influential factor, corroborating the critical role of this ratio in the catalytic conversion processes. The pre-eminence of this ratio highlights its dual function: serving as a proxy for reactant availability and an indicator of reaction dynamics.

Adding to the catalytic narrative, the bar plot spotlights the importance of elemental additives, such as lithium (+Li) and sodium (+Na). Lithium’s substantial impact suggests its significant role in altering catalyst properties and, consequently, the reaction pathway. Similarly, sodium’s prominence indicates its potential to modulate the catalyst surface interactions, resonating with the nuanced balance of catalytic elements within the reaction milieu. Figure 3 complements this by offering a granular depiction of each feature’s contribution across individual predictions. The plot’s color intensity and swarm distribution attract attention to the feature value’s role in prediction deviations. For instance, higher temperatures consistently drive higher SHAP values, underscoring the temperature’s critical role in facilitating reaction kinetics. Furthermore, the beeswarm plot reveals the nonlinear and complex interactions between features. The plot showcases how certain features, such as contact time and preparation method (Preparation_n.a.), exhibit varied impacts across different observations, hinting at underlying complexities in the catalysis process not captured by average effects alone.

The aggregation of minor feature impacts, depicted as the “Sum of 71 other features” presented in Figure 2, indicates a collective significance that rivals the top features. This aggregation underscores the multifaceted nature of the catalytic process, where numerous subtler factors collectively shape the predictive model’s output. Together, these figures not only illuminate the paramount features driving the predictive model, but also expose the breadth and depth of the feature interactions. This enhanced understanding directs us to refine the catalyst system, focusing on the optimization of key influential parameters, and offers insights into the potential synergy of combined feature effects.

This detailed analysis lays a foundation for advancing the interpretability of complex catalytic systems and demonstrates the value of machine learning as a tool for hypothesis generation in the catalytic research. The nuanced understanding gleaned from the SHAP analysis equips us to further the frontier of catalyst design, tailoring materials for specific reactions and operational conditions.

4.2. Stratified Insight Through Decision Plot Analysis

In this subsection, we dive deeper into the nuanced and collective influences of catalyst features across different performance strata by utilizing decision plots generated from SHAP values. Recognizing the complexity of OCM catalytic systems, these decision plots illustrate clearly how multiple catalyst parameters interact differently across high-yield (>25%), mid-yield (20–25%), and low-yield (15–20%) groups. It is essential to emphasize that the features shown in these decision plots, such as the methane-to-oxygen ratio and the presence of Li, Na, Ba, and others, are analyzed in the context of their collective contribution rather than as isolated elements. Different performance strata indeed highlight distinct combinations of influential parameters, revealing how specific catalyst formulations and reaction conditions synergistically define catalytic performance. This stratified analysis acknowledges the complex, context-dependent nature of catalytic effectiveness. It provides practical insights into how varying compositions and operational parameters must be collectively optimized within distinct yield ranges.

Specifically, yields exceeding 25% are often considered excellent and indicative of highly efficient catalyst formulations, reflecting the upper-performance range frequently reported as a goal in the OCM literature [2, 7, and 14]. Yields in the range of 20–25% represent a moderate-to-good performance, typical of many catalysts reported in the experimental studies, and thus reflect a critical target range for optimization efforts. Meanwhile, yields in the range of 15–20% are indicative of a modest catalyst performance, still within the realm of practical experimental relevance, but generally signaling opportunities for further optimization or improvement. These thresholds thus provide a meaningful segmentation of catalytic performance, allowing for a more targeted analysis of feature influences and interactions across distinct performance levels.

High-Yield-Group (>25%) Decision Plot: Figure 4 explains the decision plot for samples with both predicted and actual yields exceeding 25% distinctly highlights how multiple catalyst parameters collectively result in a superior catalytic performance. Notably, the methane-to-oxygen ratio (p(CH₄)/p(O₂)), sodium (+Na), lithium (+Li), methane partial pressure, and reaction contact time emerge as critical parameters that synergistically influence the model’s output. Rather than acting in isolation, these features function interdependently, indicating their joint roles in establishing conditions that significantly enhance catalytic efficiency. This collective feature interplay underscores the importance of carefully balanced catalyst formulations and reaction parameters tailored specifically to achieve high yields.

2.: Mid-Yield-Group (20–25%) Decision Plot: Within the mid-yield group as explained in Figure 5, a complex web of interdependent features is revealed, emphasizing the nuanced interplay necessary for a moderate catalytic performance. The methane-to-oxygen ratio, lithium (+Li), sodium (+Na), methane partial pressure, temperature, and contact time notably combine to guide the model’s predictions. Unlike the straightforward interactions observed in the high-yield group, this range indicates a delicate balance, where slight variations in catalyst preparation and operational conditions markedly influence the outcomes. This highlights the need for the careful optimization of catalyst composition and process conditions to sustain a consistent performance within this intermediate yield range.

3.: Lower-Yield-Group (15–20%) Decision Plot: The decision plot associated with the lower-yield group as presented in Figure 6 indicates a notable shift in the feature influence landscape. The complexity of catalyst performance becomes increasingly apparent. Here, subtle shifts in multiple features—including the methane-to-oxygen ratio, lithium (+Li), sodium (+Na), barium (+Ba), temperature, catalyst preparation methods, and lanthanum (+La)—collectively exert a considerable influence on yields. The decision plot distinctly shows that achieving even a moderate performance requires precise coordination among numerous parameters. This underscores the critical nature of fine-tuning catalyst composition and operational conditions, especially at lower yield levels, to enhance catalytic effectiveness.

Across all groups, the trajectory lines in the decision plots weave a rich narrative of how cumulative feature effects lead to the final prediction, with each step representing a feature’s incremental impact. The ‘High-Yield-Group’ plot’s dense and steep trajectory underscores the potent combination of high reactant availability and favorable catalytic conditions. In contrast, the more variable and less steep trajectories observed in the ‘Mid-’ and ‘Lower-Yield Groups’ reflect a more delicate balance of conditions necessary to achieve moderate yields. The vertical dispersion of the lines within the decision plots illustrates the diversity of feature interactions within similar yield ranges. It denotes that while certain features have a consistent directional influence across all samples, their magnitudes can vary, indicating differential sensitivities within the operational range. The decision plot analysis, tailored to yield-specific sample groupings, unveils a stratified model, showcasing how different catalyst features drive the model’s output at varying levels of C2 yield performance.

4.3. Deciphering Feature Impacts on Catalytic Efficiency Through SHAP Values

In this subsection, we delve into the interaction between key catalyst features and their SHAP values to elucidate their influence on the prediction of C2 yields. By examining scatter plots, which juxtapose feature values against SHAP values, the relationships that define the model’s predictive behavior are explained.

4.3.1. Impact of Methane-to-Oxygen Ratio on Predictive Accuracy

As presented in Figure 7, the scatter plot analysis of the methane-to-oxygen partial pressure ratio

p ({C H}_{4}) / p (O_{2})

reveals a compelling, inverse relationship with SHAP values, painting a complex picture of its role in catalytic efficiency. As the

p ({C H}_{4}) / p (O_{2})

ratio increases, we initially observe a significant decrease in SHAP values, indicating a robust negative impact on the predicted yields. Notably, this downward trend in SHAP values sharply reverses at a ratio close to 5, suggesting a pivotal threshold for catalytic behavior. This trend reaches a zenith of the SHAP value at approximately 1.3, which could represent an optimal ratio for catalytic efficiency, before declining as the ratio continues to rise. Such a finding is instrumental, as it suggests that there exists a specific operational window wherein the reactant balance is most conducive for yield optimization. These insights could lead to critical operational adjustments in OCM processes, particularly in fine-tuning the gas feed composition. The ability to pinpoint this nonlinear optimal region (~1.3–5) and the clear threshold around a ratio of ~5, where catalytic contributions shift significantly, enables more precise resource utilization and critical operational adjustments, potentially leading to significant enhancements in catalytic performance and yield optimization.

4.3.2. Lithium’s Role in Catalyst Performance

As presented in Figure 8, the scatter plot for lithium’s presence in the catalyst matrix indicates a mostly positive correlation with SHAP values, suggesting an increase in the predicted C2 yield up to a lithium concentration of about 17. Beyond this point, further increases in lithium concentration appear to diminish its positive impact, as evidenced by a decrease in SHAP values. The optimal influence of lithium on catalytic performance is achieved at a SHAP value around 17, after which its efficacy in enhancing the yield predictions declines. This analysis points to the existence of an optimal lithium concentration for maximizing the catalyst’s efficiency, providing a precise target for catalytic formulation and design. Clearly identifying this nonlinear behavior with an optimal catalytic threshold around ~17% lithium content, beyond which the performance gains diminish, provides precise guidance for catalyst formulation and enhances yield optimization strategies.

4.3.3. Temperature as a Determinant of Catalytic Activity

Figure 9 presents a scatter plot for temperature, which indicates that as the temperature rises, there is a corresponding increase in SHAP values, suggesting a favorable impact on the model’s yield predictions up to approximately 990 Kelvin. Past this temperature, the SHAP values demonstrate increased variability, which may reflect a complex interplay of thermal effects on the catalytic process. The highest SHAP value achieved is observed around 1050 Kelvin, pinpointing a specific temperature at which the catalyst’s performance is maximized. This insight into the temperature dependence of catalytic efficiency highlights the criticality of maintaining an optimal temperature range for achieving the best possible yield in catalytic reactions. Explicitly recognizing this nonlinear relationship with a distinct operational threshold around ~990 K, beyond which catalytic predictions strongly improve, underscores the critical importance of precise thermal control for optimizing catalytic efficiency.

4.3.4. Sodium’s Contribution to Yield Predictions

As explained in Figure 10, the +Na scatter plot reveals that increases in sodium content are initially associated with negative impacts on the model’s yield predictions. This pattern of declining SHAP values suggests that higher concentrations of sodium may impede the catalytic efficiency. However, a noticeable plateau in the SHAP values occurs beyond a certain level of sodium presence, indicating a saturation point where additional sodium no longer significantly affects yields. Intriguingly, the highest SHAP value for sodium is reached at around a concentration of 5, highlighting this as the point where sodium’s contribution to catalytic activity is maximized. These data illustrate the nuanced role of sodium in catalysis, where it may have beneficial effects at lower concentrations but deleterious effects when overrepresented, emphasizing the importance of balancing sodium levels within the catalyst formulation. Clearly identifying this nonlinear trend and the optimal catalytic threshold (~5% sodium concentration), beyond which additional sodium yields diminishing returns, facilitates targeted adjustments in catalyst composition to maximize efficiency.

4.3.5. Influence of Lanthanum on Model Predictions

The scatter plot for lanthanum’s impact on catalytic performance, as shown in Figure 11, indicates that moderate levels of lanthanum correspond with positive SHAP values, suggesting an enhancement in yield predictions up to a certain point. As the concentration of lanthanum increases, the magnitude of its beneficial impact begins to wane, implying that there is a threshold of effectiveness. This diminishing return on SHAP values at higher lanthanum concentrations implies that while lanthanum can act as a catalyst activator, its effectiveness plateaus at a concentration beyond which it may become detrimental to the yield. The highest SHAP value is observed at a lanthanum concentration of 10, pinpointing this level as the potential sweet spot for its catalytic contribution. Clearly recognizing this nonlinear behavior with an optimal threshold near ~10% lanthanum concentration, beyond which additional lanthanum does not substantially enhance the performance, provides valuable insights into strategic catalyst optimization.

In revising the discussion to specifically address amalgamated Research Question 3, we expand our understanding of how SHAP value analysis elucidates the impact of catalyst components and operational parameters on C2 yield predictions within oxidative-coupling-of-methane (OCM) reactions. This comprehensive SHAP analysis insightfully decodes the contributions and interactions of crucial features, enhancing model interpretability and guiding catalytic optimization.

Our consolidated SHAP value analysis delves into the significant roles of specific features, like the methane-to-oxygen ratio

p ({C H}_{4}) / p (O_{2})

, lithium (

+ L i

), and reaction temperature, among others, in influencing the model’s predictive outcomes. This analysis uncovered several critical insights:

The methane-to-oxygen ratio emerges as a pivotal factor, where an optimal range is essential for maximizing catalytic efficiency. An increase in this ratio boosts the model’s yield predictions up to a certain threshold, beyond which the effect inversely impacts the yield, marking a delicate balance in operational parameters.
Lithium content within the catalyst matrix demonstrates a positive correlation with SHAP values up to a specific concentration, suggesting an optimal lithium level that facilitates catalytic activity without leading to diminishing returns.
The influence of reaction temperature on SHAP values reveals an optimal operational temperature that significantly enhances yield predictions, highlighting temperature’s critical role in catalysis.
Through the lens of SHAP scatter plots and decision plots, a nuanced understanding of nonlinear relationships and interaction effects among features is achieved. For example:
Sodium (Na) and lanthanum (La) content exhibit complex, nonlinear relationships with yield predictions, with their positive impacts plateauing at certain concentrations. This indicates a nuanced influence, where too much or too little can affect the yield outcomes.
The decision plots provide a stratified view of how different feature levels impact yield predictions across various ranges, offering a granular perspective that aids in identifying precise operational and compositional optimizations for enhanced catalytic performance.

These insights from the SHAP analysis significantly contribute to our understanding of the multifaceted nature of catalyst behavior. They reveal not only the key drivers of yield, but also the intricate interactions between various catalyst components and operational conditions. This enhanced model transparency, facilitated by SHAP analysis, opens new avenues for data-driven catalyst design and operational optimizations, crucial for the progression of OCM technologies. In summary, the SHAP analyses offer a profound validation of our model’s predictive capacity and furnish a detailed map of feature influences. This not only bridges the gap between theoretical predictions and practical catalysis applications, but also sets a benchmark for future catalyst development and reaction condition optimization efforts.

4.4. Practical Catalyst Design Recommendations and Experimental Optimization

Based on the detailed SHAP analyses presented, explicit recommendations for catalyst optimization and experimental guidance are clearly identified:

Methane-to-Oxygen Partial Pressure Ratio (p(CH₄)/p(O₂)): Optimal catalyst performance is explicitly achieved within a methane-to-oxygen ratio window in the range of approximately 1.3–5. Operationally, maintaining this ratio below the clear threshold (~7) is critical, as higher ratios significantly reduce catalytic efficiency. Thus, experimental optimization should prioritize this optimal operational window to enhance the yield.
Lithium Concentration (+Li): An optimal lithium concentration is clearly identified at approximately 17%, beyond which catalytic benefits diminish substantially. Catalyst designs should explicitly target this lithium concentration range experimentally, with precise compositional adjustments to maximize efficiency and resource use.
Reaction Temperature: Clearly optimal catalytic conditions are identified at reaction temperatures in the range of 990–1050 K. Experimental strategies should explicitly focus on maintaining this thermal window to optimize catalytic efficiency, as deviations significantly influence the performance negatively.
Sodium Concentration (+Na): Optimal sodium concentration is clearly identified around 5%, with clear diminishing returns beyond this threshold. Experimental catalyst design must therefore explicitly consider careful sodium content optimization within this nonlinear optimal concentration range.
Lanthanum Concentration (+La): Optimal lanthanum content explicitly occurs around 10%, beyond which additional lanthanum contributes minimal or negative returns. Experimentally, catalyst formulations must explicitly focus on this identified optimal lanthanum content threshold for an enhanced performance.

These explicitly identified optimal conditions and thresholds derived directly from SHAP insights offer valuable practical guidance. Future experimental studies should leverage these specific recommendations to strategically design catalysts and optimize operational conditions, thereby clearly maximizing the catalytic performance and resource efficiency.

Overall, the SHAP-based feature interpretations reinforce the complexity inherent in OCM catalysis, clearly illustrating how catalytic efficiency emerges from the synergistic interactions among multiple chemical elements and operational conditions. These interactions extend beyond simple additive effects, often involving nonlinear and context-dependent relationships that necessitate meticulous experimental validation. The insights obtained here provide critical guidance for future studies aimed at experimentally verifying and refining these predictive insights, ultimately enhancing the rational design of advanced catalyst materials optimized for industrial OCM processes.

This research displays significant progress for predictive modeling to study catalyst performance in the oxidative-coupling-of-methane (OCM) reaction. Two main constraints need to be noted within this research framework. We did not validate this approach to work on competing OCM and other types of catalytic reactions and systems showing different reactions. The ACPD shows promising features for scalability while requiring additional adjustments when dealing with processes having different types of data or chemical reactions. The ability to understand the results clearly represents an additional major technical obstacle. The interpretive power of SHAP analysis requires massive amounts of computer processing power, although it provides excellent transparency of machine learning prediction processes. Operations involving extensive datasets along with numerous features lead to SHAP becoming resource-intensive and producing a slow performance. Applying these models on an industrial scale faces substantial difficulty from this aspect.

The quality of the data remains the main limitation on the model’s accuracy, although solid optimization and well-planned feature selection improve its performance. The datasets of the published research studies we use contain ample information but might introduce inconsistencies because the experimental methods differ between papers while potential publication bias could exist too. To reach industrial application levels, this framework needs development for more reaction types and additional experimental data validation together with quick and scalable methods for model interpretation.

5. Materials and Methods

The pursuit of optimizing catalytic processes, particularly in enhancing C2 yields through the oxidative coupling of methane (OCM), requires an integrated and scalable approach that combines advanced machine learning techniques, innovative feature engineering, and robust interpretability frameworks. Our methodology was structured to address this challenge holistically, encompassing data preprocessing, creative representation of catalyst compositions, rigorous model evaluation, hyperparameter tuning, and interpretability analysis.

As illustrated in Figure 12, the foundation of our approach is based on the development of the Aggregated Catalyst Physicochemical Descriptor (ACPD). This scalable and robust feature engineering technique captures the nuanced contributions of catalyst components. This innovation ensures that the representation of catalyst compositions remains comprehensive yet computationally efficient, enabling effective downstream modeling.

Building on this foundation, we evaluated a diverse array of regression models to identify the most predictive and reliable approaches. The evaluation was coupled with Modified Sequential Model-Based Optimization (SMBO), an advanced hyperparameter tuning strategy, to ensure optimal model performance. To provide a reliable and unbiased assessment, we employed the Cross-Validation with Stratified Regression (CVSR) technique, which accounts for variations in the dataset while maintaining representative splits. Model interpretability is a cornerstone of our methodology. Using frameworks such as SHAP, we delved into the Catalyst Anatomy, uncovering the intricate interplay between catalyst components and operational conditions. This interpretability analysis not only validated model predictions, but also provided deep insights into the underlying mechanisms influencing C2 yields. This comprehensive methodology bridges the gap between predictive accuracy and mechanistic understanding, paving the way for the rational design and optimization of catalysts in OCM processes. By integrating data-driven techniques with interpretability frameworks, we aimed to elucidate the complex dynamics of catalytic systems and advance the science of catalyst innovation.

5.1. Dataset

Our study was based on a vast reference source [26] that first collected more than 1870 papers published within 3 decades of OCM investigation. This database contains a wide variety of catalyst compositions, comprising 68 different types of catalytic components, including 61 cations and seven anions, which serve as active site promoters and support materials that provide a range of catalytic capabilities. However, to minimize data variation caused by variable stoichiometry of elements in the OCM reaction, oxygen was excluded from these elements. Additional structuring of the dataset was carried out by Schmack et al. [23], who subdivided the elements based on their functions in the catalyst compositions. It provided a systematic approach to analyzing the effect of the elemental composition of catalysts on the yield and selectivity of the OCM process, establishing a 68-dimensional feature space for predictive purposes. The primary focus of this dataset was the C2 yield (YC2), a crucial measure of the catalyst’s efficiency in OCM reactions. YC2 was applied with precision parallel to critical reaction parameters, like catalyst composition, reaction conditions involving temperature, CH₄ and O₂ partial pressures, total pressure and contact time, and additional performance metrics of O₂ conversion, CH₄ conversion, and COx, ethane, and ethylene selectivity. It also afforded a rich analysis of the relationship between the catalytic properties and the resulting C2 yield. This study was based on an extensive dataset derived from the pivotal research on the oxidative coupling of methane (OCM), as well as three additional datasets. These datasets collectively enhanced the scope and robustness of the analysis.

Catalyst Composition Insight Dataset (Dataset A)

The dataset presented in [26] is the largest and forms the foundational basis of our study. It comprises over 1870 papers spanning three decades, documenting 68 different catalytic components, as shown in Figure 13a (61 cations and 7 anions), utilized as active site promoters and supports. The dataset focuses on critical reaction parameters, including methane and oxygen partial pressures, reaction temperature, total pressure, and selectivity metrics for C2 yield. While comprehensive, this dataset is literature-dependent, emphasizing the interplay between catalyst composition and performance.

Literature-Enriched Machine Learning Dataset (Dataset B)

Dataset B [2] expands upon the literature-based findings in dataset A. It incorporates 4759 experimental data points and emphasizes 74 catalytic elements, as shown in Figure 13b, using machine learning methods to extrapolate trends and identify potential novel catalysts. This dataset is partially dependent on dataset A, as it draws upon overlapping literature sources but also integrates additional experimental insights up to 2019. Its unique value lies in its ability to leverage machine learning to predict trends in catalytic performance.

High-Throughput Experimentation Dataset (Dataset C)

This dataset [23] focuses exclusively on high-throughput experimentation. It includes detailed experimental data comprising 12,708 data points for 59 catalysts across 3 successive operations, encompassing 27 distinct catalytic elements, as shown in Figure 13c. Metrics such as methane conversion rates, C2 selectivity, and operational conditions (temperature, pressure) are systematically documented. Dataset C’s independence from datasets A and B ensures that it provides experimental validation for trends identified in the literature-based datasets, enriching the robustness of the overall analysis.

5.2. Dataset Preprocessing

The datasets collectively provided a comprehensive range of variables, including catalyst composition (cations, anions, and support materials), reaction conditions (temperature, pressure, methane, and oxygen partial pressures), and performance metrics, such as C2 yield (Y (C2), %), CH4 conversion, and selectivity for ethane and ethylene. Preprocessing involved pivotal transformations of catalyst compositions and support materials to align molar percentages with corresponding material identifiers. Each catalyst was represented as a row, while elements were treated as columns, with non-contributing elements filled with zero. The aggregation of composition data ensured compatibility and facilitated analysis. Additionally, synthesis methodologies were captured using binary encoding for preparation methods. The importance of Y (C2) as a target variable was emphasized by retaining the continuous distinctions in concentrations of Y (C2) and percentage to better understand the interactions occurring in the catalytic system. This approach captures the fine details in yield outcomes, enabling the model to account for small fluctuations that are crucial for determining the ultimate performance of catalysts. Retaining Y (C2), % as a continuous variable added depth to the model’s interpretation, ensuring that the predictions reflected the nuanced behavior of the catalytic systems. These preprocessing steps ensured standardized and consistent data, enabling robust analysis and the optimization of predictive modeling frameworks for OCM processes.

5.3. Aggregated Catalyst Physicochemical Descriptor (ACPD)

In the study by Mine et al. [2], the authors aimed to identify novel catalysts for the oxidative coupling of methane (OCM) by integrating physicochemical descriptors of elements with existing datasets. They employed an extrapolative machine learning approach, where each catalyst was represented by its constituent elements’ descriptors. Specifically, for a catalyst composed of elements

E_{1}, E_{2} \dots, E_{n}

with corresponding weight percentages

ω_{1}, ω_{2} \dots, ω_{n}

, and each element characterized by descriptors

D_{1}

,

D_{2}

,…,

D_{m}

, features were created by multiplying each element’s weight percentage by its descriptor value. This resulted in features such as

E 1_D 1 = ω_{1} \times D_{1} (E_{1})

,

E 1_D 2 = ω_{1} \times D_{2} (E_{1})

and so forth for each element and descriptor. Consequently, the number of features scaled with both the number of components and the number of descriptors. For instance, with 5 elements and 20 descriptors, this method would generate 100 features, each representing the contribution of an element’s descriptor to the catalyst’s characteristics.

The ACPD is a smart way to simplify the representation of catalysts in machine learning, particularly in catalytic chemistry. Typically, each element in a catalyst is characterized by several physical and chemical properties, resulting in a vast and complex dataset. This complexity often causes models to overfit—performing well on known data but struggling with new, unfamiliar examples. The ACPD helps solve this by combining all those detailed properties into a smaller, more manageable set of values. It works by taking a weighted average of each property based on the proportion of each element present in the catalyst. This provides a clearer picture of the overall catalyst rather than analyzing each aspect separately. As a result, the data become easier to work with, and the models built on them are usually more straightforward, more accurate, and easier to interpret. In practice, the ACPD calculates each combined value by multiplying an element’s property by its weight percentage, adding them up, and adjusting for the total. Altogether, the ACPD strikes a good balance between being detailed and staying efficient.

While this method provides detailed insights into individual element contributions, it is not scalable. The number of features changes with the number of involved elements, leading to a high-dimensional feature space that can complicate the model and potentially cause overfitting. To address this scalability issue, our updated methodology introduced an innovative aggregated approach to represent physicochemical descriptors at the catalyst level. This method reduces dimensionality, enhances scalability, and retains essential information based on the following formulas:

Aggregated Descriptor Calculation: For a catalyst composed of elements $E_{1}, E_{2} \dots, E_{n}$ with weight percentages $ω_{1}, ω_{2}$ … $, ω_{n}$ , and each element characterized by descriptors $D_{1}, D_{2} \dots, D_{n}$ , the aggregated descriptor ${\bar{D}}_{j}$ for descriptor $D_{j}$ is calculated as:

$\bar{D_{j}} = \frac{\sum_{i = 1}^{n} (ω_{i} \times D_{j} (E_{i}))}{\sum_{i = 1}^{n} w_{i}}$

(1)

where:
-
n is the number of elements in the catalyst.
-
ω_i is the weight percentage of element $E_{i}$ .
-
$D_{j} (E_{i})$ is the value of descriptor $D_{j}$ for element $E_{i}$ .

This equation ensures that each descriptor,

D_{j}

, represents a weighted average of the contributions of all constituent elements in the catalyst, normalized by their total weight percentages.

Categorical Descriptors Representation: For categorical descriptors, such as ‘group’ and ‘period’ in the periodic table, one-hot encoding is applied. Each element’s presence in the catalyst is indicated by setting the corresponding group and period features to 1, while non-contributing elements are set to 0. This encoding effectively captures categorical properties without unnecessarily increasing dimensionality.
Final Feature Set: The final feature set for each catalyst includes:
-
Aggregated physicochemical descriptors ${\bar{D}}_{1}, {\bar{D}}_{2}, \dots, {\bar{D}}_{m}$ .
-
One-hot encoded categorical features for ‘group’ and ‘period’.

This approach offers several key innovations and advantages:

Scalability: By reducing the number of features to a fixed set of aggregated descriptors, the method is independent of the number of elements in the catalyst, ensuring scalability for large datasets.
Simplicity: The representation of catalyst compositions is simplified, making it easier to interpret and analyze.
Robustness: Aggregating descriptors ensures that the feature space captures essential information without being overly sensitive to specific element inclusion or exclusion.
Preservation of Nuances: The weighted averaging process retains fine details of individual element contributions, enabling accurate and insightful predictive modeling.

By implementing this aggregated approach, we achieved a scalable and robust representation of catalyst compositions that enhanced the predictive power of machine learning models while maintaining interpretability and reducing the risk of overfitting.

5.4. Different Model Evaluations

In pursuit of the overarching goal to accurately predict the C2 yield (Y (C2), %) from the oxidative coupling of methane reactions, our study embarked on a comprehensive evaluation of a diverse array of regression models. This evaluation was conducted with the intention of identifying the model or models that exhibited the highest predictive accuracy and reliability. The regression models assessed included a wide range of algorithms, each with unique characteristics and assumptions about the underlying data. The models evaluated are as follows:

Random Forest Regressor

Random Forest is an ensemble learning method that builds multiple decision trees during training and combines their outputs to improve predictive accuracy. It excels in capturing nonlinear relationships and is robust against overfitting, making it suitable for datasets with high-dimensional features and complex interactions.

XGBoost Regressor

XGBoost is an efficient implementation of gradient-boosted decision trees designed for speed and performance. It employs regularization techniques to enhance generalization and is particularly effective in handling imbalanced data, making it a powerful choice for predictive tasks with intricate feature interactions.

LightGBM Regressor

LightGBM is a gradient-boosting framework that uses tree-based learning algorithms. Known for its efficiency and scalability, it handles large datasets with minimal memory consumption and supports advanced features, such as leaf-wise tree growth, for improved accuracy.

Extra Trees Regressor

Extra Trees is another ensemble method similar to Random Forest but introduces randomness by selecting splits randomly during tree construction. This characteristic helps reduce overfitting and increase model diversity, making it a robust alternative for complex datasets.

Gradient Boosting Regressor

Gradient Boosting builds models sequentially, where each new model corrects the errors of its predecessor. It is effective in optimizing prediction accuracy and is well-suited for datasets with nonlinear relationships and interactions among features.

These algorithms were evaluated using a 10-fold cross-validation strategy to ensure robustness and minimize bias. The results of this evaluation are presented in the subsequent sections, highlighting the models’ predictive capabilities and suitability for analyzing OCM catalyst performance.

5.5. Cross-Validation with the Stratified Regression (CVSR) Technique

Building on the foundation laid in the initial experiments, we introduced the Cross-Validation with Stratified Regression (CVSR) technique to refine our best-performing model. This novel approach integrates the concept of stratification with regression tasks, ensuring a balanced representation of the target variable’s distribution across each fold in cross-validation.

CVSR operates on the principle of stratified sampling, where the dataset

D

is divided into

k

mutually exclusive folds, with each fold maintaining a similar statistical distribution of the target variable,

Y

. Let

D_{i}

represents the

i^{t h}

fold, such that

D = \cup_{i = 1}^{k} D_{i}

and

D_{i} \cap D_{j} = \emptyset

for

i \neq j

.

For a regression model, M, the CVSR process is formalized as follows:

Stratification: For each fold, D_i, ensure that the distribution of Y within D_i closely mirrors the overall distribution of Y in D.
Training and Validation: For each iteration, i, train M on D/D_i and validate on D_i. This produces a prediction set P_i for D_i.
Aggregation: Aggregate the prediction sets {P₁, P₂, …, P_k} to form the comprehensive prediction set P for D, which is then used to assess the model’s performance using metrics such as R², MSE, and RMSE.

This technique ensures each validation fold is a stratified sample of the entire dataset, which is particularly beneficial for datasets with non-uniform distributions of the target variable. By preserving the target distribution in each fold, CVSR enhances the reliability and generalizability of the model’s performance metrics.

5.6. Modified Sequential Model-Based Optimization (SMBO) for Hyperparameter Tuning

This study employed a Modified Sequential Model-Based Optimization (SMBO) approach to optimize hyperparameters of the best model performed in experiment 3, which is an Extra Trees classifier for datasets X, Y, and LightGBM for dataset Z. Unlike traditional SMBO, our method optimized each parameter sequentially while keeping others fixed at their best-known values. This sequential strategy enabled us to evaluate the contribution of each parameter to model performance, striking a balance between efficiency and the ability to explore distinct parameter configurations. Let

θ i

denote a parameter to optimize (e.g., max_depth, n_estimators, etc.). Define the function

f (θ_{i}| θ_{- i})

, where

θ_{- i}

represents all other parameters held at their current optimal values. This function,

f

, computes the model’s R² score for each trial based on

θ_{i}

, while keeping

θ_{- i}

fixed.

Objective Function: Our objective function is expressed as:

\arg \max_{θi} E [f (θ_{i}| θ_{- i})]

(2)

By iteratively maximizing

E [f (θ_{i}| θ_{- i})]

for each parameter,

θ_{i}

, we sequentially converge toward a configuration that achieves the best model performance.

Sequential Update Process: Starting with default parameters $θ_{d e f a u l t}$ , let:

θ^{(0)} = θ_{d e f a u l t}

(3)

For each parameter,

θ_{i}

, we perform the following update at iteration k:

θ^{(k)} = θ^{(k - 1)} \arg \max_{θi} f (θ_{i}| θ_{- i}^{(k - 1)})

(4)

where k represents each iteration in which a single parameter is optimized, updating the parameter set toward the best configuration.

5.7. Performance Metrics

The performance of each model was evaluated using three key metrics: the coefficient of determination (R2), the Mean Squared Error (MSE), and the Root Mean Squared Error (RMSE). These metrics were chosen to provide a holistic view of the models’ performance:

R² (Coefficient of Determination): Indicates the proportion of variance in the dependent variable that can be predicted from the independent variables. A higher R² signifies a better model performance.
MSE (Mean Squared Error): Measures the average squared difference between estimated values and actual values. A lower MSE indicates a better fit.
RMSE (Root Mean Squared Error): The square root of MSE, providing a scale-relative measure of error. A lower RMSE suggests a more accurate model.

The results of this comprehensive evaluation are discussed in the subsequent sections, highlighting the models that demonstrated superior performance across the relevant metrics. This rigorous evaluation process is instrumental in guiding the selection of the most suitable model(s) for predicting the C2 yield in OCM reactions, thus contributing significantly to the advancement of the catalysis research.

5.8. Catalyst Anatomy via SHAP Explainer

In the pursuit of not just predicting, but also understanding, the intricate dynamics that drive the C2 yield in oxidative-coupling-of-methane (OCM) reactions, our research delved into the ‘anatomy’ of the catalyst using SHapley Additive exPlanations (SHAP) Explainer. This approach facilitated a deeper exploration into how each feature contributed to the model’s predictions, thereby unveiling the complex interplay between catalyst components and their impact on the target variable. SHAP values, grounded in game theory, offer a robust framework for interpreting the predictions of machine learning models. By attributing the model’s output to its input features, SHAP values provide a detailed explanation of the contribution of each feature to individual predictions, offering a granular view of the model’s behavior. This methodology is particularly valuable in the catalysis research, where understanding the specific role of each catalyst component can lead to significant insights and breakthroughs.

5.8.1. Summarizing Feature Effects

The SHAP Explainer was employed to summarize the effects of all features on the target variable. This was achieved by aggregating the SHAP values for each feature across the dataset, thereby providing a global perspective on the importance and impact of each feature. The summary plot generated by SHAP offered a visual representation of feature importance ranked by the magnitude of their impact on the model’s output, alongside the distribution of the effects each feature had on the model’s predictions.

5.8.2. Catalyst Feature Contributions

To delve into the specifics of how individual features contributed to particular predictions, we utilized SHAP’s waterfall plots on different samples. These plots offer a step-by-step decomposition of the prediction, starting from the base value (the model’s average prediction across the dataset) and sequentially adding the effect of each feature. By analyzing waterfall plots for various records, we were able to identify the unique contribution patterns of features in specific instances, providing an insight into the variability of feature contributions across different catalyst compositions and operational conditions.

5.8.3. Relationship Exploration

Focusing on the most crucial feature identified through the global summary, we further explored its relationship with the target variable using SHAP’s dependence plot. This plot illustrates how the SHAP value (impact on model output) of a given feature varies with the feature’s actual value, highlighting potential nonlinear relationships or interactions with other features. By applying the dependence plot to the most influential feature, we gained a nuanced understanding of how this key catalyst component influences the C2 yield, potentially uncovering non-intuitive relationships and interaction effects that contributed to the catalyst’s performance. The application of SHAP Explainer to our model offered profound insights into the ‘anatomy’ of the catalyst, illuminating the specific roles and relative importance of various catalyst components and operational parameters. This analytical depth extends beyond mere prediction, providing actionable intelligence that guides the design and optimization of catalysts for improved performance in OCM reactions. Through this meticulous exploration, we not only enhanced the transparency and interpretability of our predictive model but also contributed to a deeper understanding of catalytic processes, laying the groundwork for future innovations in the field.

6. Conclusions

This research journey into predictive modeling for oxidative-coupling-of-methane (OCM) processes provided valuable insights into the complexities of catalytic systems. By integrating advanced machine learning techniques with robust analytical frameworks, our study not only enhanced predictive accuracy, but also uncovered the intricate dynamics underlying catalytic reactions. Central to our methodology was the development of Aggregated Catalyst Physicochemical Descriptors (ACPDs), which streamlined feature representation, and the use of stratified regression, which improved model generalizability. These innovations addressed key challenges in handling complex and diverse datasets, enabling the creation of highly accurate and scalable predictive models. Through hyperparameter optimization using Modified Sequential Model-Based Optimization (SMBO), we further refined the models, achieving exceptional performances across all datasets. Notably, dataset B, with its diverse elemental composition, highlighted the robustness of our approach, achieving superior R² values compared to prior studies. This success underscores the adaptability of our methodology to varied data contexts, making it a versatile tool for catalysis research. Complementing these advancements, SHAP (SHapley Additive exPlanations) analysis provided a detailed interpretive framework to dissect the contributions of individual features to model predictions. By leveraging visual tools, such as scatter plots, decision plots, and feature importance analyses, we bridged the gap between computational predictions and the fundamental principles of catalysis. This analysis illuminated how catalyst components and operational variables interact to influence performance, offering actionable insights for experimental validation and catalyst design.

In conclusion, this study represents more than an incremental refinement of predictive models. It signifies a step forward in the integration of machine learning with catalytic science, fostering a synergistic relationship between computational tools and experimental chemistry. By enhancing the understanding of catalytic processes and improving predictive accuracy, our research lays a foundation for achieving higher efficiencies and sustainability in chemical manufacturing. This convergence of data-driven approaches with fundamental scientific inquiry sets the stage for future innovations in catalyst development and process optimization, driving the field of catalysis toward a more informed and iterative scientific paradigm.

Author Contributions

Data curation, M.E., H.A., A.S.A. and H.M.A.H.; formal analysis, A.M.M. and H.M.A.H.; investigation, M.E. and H.M.A.H.; supervision, B.A., A.M.M., M.E. and R.M.K.M.; writing—original draft, M.E., B.A., A.S.A., A.M.M. and H.M.A.H.; writing—review and editing, M.E., A.M.M. and R.M.K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Deanship of Graduate Studies and Scientific Research at Jouf University under Grant No. DGSSR-2023-02-02454.

Data Availability Statement

The datasets utilized in this study are publicly accessible and clearly referenced in the manuscript. Specifically, they are available through the following original publications: https://github.com/mts-uw/OCM/tree/master/data (accessed on 15 February 2025); https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cctc.201100186 (accessed on 15 February 2025); https://cads.eng.hokudai.ac.jp/datamanagement/datasources/21010bbe-0a5c-4d12-a5fa-84eea540e4be/ (accessed on 15 February 2025).

Acknowledgments

The authors extend their appreciation to the Deanship of Graduate Studies and Scientific Research at Jouf University under Grant No. DGSSR-2023-02-02454.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nishimura, S.; Ohyama, J.; Kinoshita, T.; Dinh Le, S.; Takahashi, K. Revisiting Machine Learning Predictions for Oxidative Coupling of Methane (OCM) based on Literature Data. ChemCatChem 2020, 12, 5888–5892. [Google Scholar] [CrossRef]
Mine, S.; Takao, M.; Yamaguchi, T.; Toyao, T.; Maeno, Z.; Hakim Siddiki, S.M.A.; Takakusagi, S.; Shimizu, K.; Takigawa, I. Analysis of Updated Literature Data up to 2019 on the Oxidative Coupling of Methane Using an Extrapolative Machine-Learning Method to Identify Novel Catalysts. ChemCatChem 2021, 13, 3636–3655. [Google Scholar] [CrossRef]
Nishimura, S.; Li, X.; Ohyama, J.; Takahashi, K. Leveraging machine learning engineering to uncover insights into heterogeneous catalyst design for oxidative coupling of methane. Catal. Sci. Technol. 2023, 13, 4646–4655. [Google Scholar] [CrossRef]
Ugwu, L.I., Morgan, Y., Ibrahim, H., Eds.; Increasing Ethene Yield via Oxidative Coupling of Methane at Low Temperature: An Application of Machine Learning and DFT in the Design and Innovation of Effective Catalyst Compositions. In Proceedings of the 2023 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE), Regina, SK, Canada, 24–27 September 2023. [Google Scholar] [CrossRef]
Alturkistani, S.; Wang, H.; Gautam, R.; Sarathy, S.M. Importance of Process Variables and Their Optimization for Oxidative Coupling of Methane (OCM). ACS Omega 2023, 8, 21223–21236. [Google Scholar] [CrossRef]
Kiatsaengthong, D.; Jaroenpanon, K.; Somchuea, P.; Chukeaw, T.; Chareonpanich, M.; Faungnawakij, K.; Sohn, H.; Rupprechter, G.; Seubsai, A. Effects of Mg, Ca, Sr, and Ba Dopants on the Performance of La₂O₃ Catalysts for the Oxidative Coupling of Methane. ACS Omega 2022, 7, 1785–1793. [Google Scholar] [CrossRef]
Nishimura, S.; Le, S.D.; Miyazato, I.; Fujima, J.; Taniike, T.; Ohyama, J.; Takahashi, K. High-throughput screening and literature data-driven machine learning-assisted investigation of multi-component La₂O₃-based catalysts for the oxidative coupling of methane. Catal. Sci. Technol. 2022, 12, 2766–2774. [Google Scholar] [CrossRef]
Ishioka, S.; Miyazato, I.; Takahashi, L.; Nguyen, T.N.; Taniike, T.; Takahashi, K. Unveiling gas-phase oxidative coupling of methane via data analysis. J. Comput. Chem. 2021, 42, 1447–1451. [Google Scholar] [CrossRef] [PubMed]
Sugiyama, K.; Nguyen, T.N.; Nakanowatari, S.; Miyazato, I.; Taniike, T.; Takahashi, K. Direct Design of Catalysts in Oxidative Coupling of Methane via High-Throughput Experiment and Deep Learning. ChemCatChem 2021, 13, 952–957. [Google Scholar] [CrossRef]
Li, X.; Xie, J.; Rao, H.; Wang, C.; Tang, J. Platinum- and CuO-Decorated TiO₂ Photocatalyst for Oxidative Coupling of Methane to C₂ Hydrocarbons in a Flow Reactor. Angew. Chem. Int. Ed. 2020, 59, 19702–19707. [Google Scholar] [CrossRef]
Miyazato, I.; Nishimura, S.; Takahashi, L.; Ohyama, J.; Takahashi, K. Data-Driven Identification of the Reaction Network in Oxidative Coupling of the Methane Reaction via Experimental Data. J. Phys. Chem. Lett. 2020, 11, 787–795. [Google Scholar] [CrossRef]
Vandewalle, L.A.; Van de Vijver, R.; Van Geem, K.M.; Marin, G.B. The role of mass and heat transfer in the design of novel reactors for oxidative coupling of methane. Chem. Eng. Sci. 2019, 198, 268–289. [Google Scholar] [CrossRef]
Matsumoto, T.; Saito, M.; Ishikawa, S.; Fujii, K.; Yashima, M.; Ueda, W.; Motohashi, T. High Catalytic Activity of Crystalline Lithium Calcium Silicate for Oxidative Coupling of Methane Originated from Crystallographic Joint Effects of Multiple Cations. ChemCatChem 2020, 12, 1968–1972. [Google Scholar] [CrossRef]
Nakanowatari, S.; Nguyen, T.N.; Chikuma, H.; Fujiwara, A.; Seenivasan, K.; Thakur, A.; Takahashi, L.; Takahashi, K.; Taniike, T. Extraction of Catalyst Design Heuristics from Random Catalyst Dataset and their Utilization in Catalyst Development for Oxidative Coupling of Methane. ChemCatChem 2021, 13, 3262–3269. [Google Scholar] [CrossRef]
Si, J.; Zhao, G.; Sun, W.; Liu, J.; Guan, C.; Yang, Y.; Shi, X.R.; Lu, Y. Oxidative Coupling of Methane: Examining the Inactivity of the MnO-Na2WO4/SiO2 Catalyst at Low Temperature. Angew. Chem. Int. Ed. 2022, 61, e202117201. [Google Scholar] [CrossRef] [PubMed]
Goldsmith, B.; Esterhuizen, J.; Bartel, C.; Sutton, C.; Liu, J.X. Machine Learning for Heterogeneous Catalyst Design and Discovery. AIChE J. 2018, 64, e202117201. [Google Scholar] [CrossRef]
Ugwu, L.; Morgan, Y.; Ibrahim, H. Enhancing Ethene Production through Low-Temperature Oxidative Coupling of Methane: Leveraging DFT and Data Analysis for Crafting Innovative and Efficient Catalyst Compositions. Ind. Eng. Chem. Res. 2023, 62, 19658–19673. [Google Scholar] [CrossRef]
Sutthiumporn, K.; Kawi, S. Promotional effect of alkaline earth over Ni–La₂O₃ catalyst for CO₂ reforming of CH4: Role of surface oxygen species on H₂ production and carbon suppression. Int. J. Hydrogen Energy 2011, 36, 14435–14446. [Google Scholar] [CrossRef]
Takahashi, K.; Miyazato, I.; Nishimura, S.; Ohyama, J. Unveiling Hidden Catalysts for the Oxidative Coupling of Methane based on Combining Machine Learning with Literature Data. ChemCatChem 2018, 10, 3223–3228. [Google Scholar] [CrossRef]
Nguyen, T.N.; Nhat, T.T.P.; Takimoto, K.; Thakur, A.; Nishimura, S.; Ohyama, J.; Miyazato, I.; Takahashi, L.; Fujima, J.; Takahashi, K.; et al. High-Throughput Experimentation and Catalyst Informatics for Oxidative Coupling of Methane. ACS Catal. 2020, 10, 921–932. [Google Scholar] [CrossRef]
Nguyen, T.N.; Nakanowatari, S.; Nhat Tran, T.P.; Thakur, A.; Takahashi, L.; Takahashi, K.; Taniike, T. Learning Catalyst Design Based on Bias-Free Data Set for Oxidative Coupling of Methane. ACS Catal. 2021, 11, 1797–1809. [Google Scholar] [CrossRef]
Pirro, L.; Mendes, P.S.F.; Vandegehuchte, B.D.; Marin, G.B.; Thybaut, J.W. Catalyst screening for the oxidative coupling of methane: From isothermal to adiabatic operation via microkinetic simulations. React. Chem. Eng. 2020, 5, 584–596. [Google Scholar] [CrossRef]
Schmack, R.; Friedrich, A.; Kondratenko, E.V.; Polte, J.; Werwatz, A.; Kraehnert, R. A meta-analysis of catalytic literature data reveals property-performance correlations for the OCM reaction. Nat. Commun. 2019, 10, 441. [Google Scholar] [CrossRef] [PubMed]
Pirro, L.; Mendes, P.S.F.; Paret, S.; Vandegehuchte, B.D.; Marin, G.B.; Thybaut, J.W. Descriptor–property relationships in heterogeneous catalysis: Exploiting synergies between statistics and fundamental kinetic modelling. Catal. Sci. Technol. 2019, 9, 3109–3125. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 4768–4777. [Google Scholar]
Zavyalova, U.; Holena, M.; Schlögl, R.; Baerns, M. Statistical Analysis of Past Catalytic Data on Oxidative Methane Coupling for New Insights into the Composition of High-Performance Catalysts. ChemCatChem 2011, 3, 1935–1947. [Google Scholar] [CrossRef]

Figure 1. Best model prediction against actual C2 yields.

Figure 2. SHAP bar plot.

Figure 3. SHAP beeswarm plot.

Figure 4. Decision plot for high-yield group (>25%).

Figure 5. Decision plot for mid-yield group (20–25%).

Figure 6. Decision plot for lower-yield group (15–20%).

Figure 7. Scatter plot analysis of the methane-to-oxygen partial pressure.

Figure 8. Scatter plot analysis of lithium, +Li.

Figure 9. Temperature K scatter plot.

Figure 10. Scatter plot analysis of sodium, +Na.

Figure 11. Scatter plot analysis of lanthanum, +La.

Figure 12. Proposed framework of the Aggregated Catalyst Physicochemical Descriptor (ACPD).

Figure 13. Most-contributed elements in the three datasets. (a) Contributed elements of dataset A. (b) Contributed elements of dataset B. (c) Contributed elements of dataset C.

Table 1. Results of baseline model evaluation.

Model	Dataset	R²	MSE	RMSE
LightGBM Regressor	A	0.5594	19.45	4.41
Random Forest Regressor	A	0.5575	19.54	4.42
XGBoost Regressor	A	0.5501	19.86	4.46
Extra Trees Regressor	A	0.5360	20.49	4.53
Gradient Boosting Regressor	A	0.4625	23.73	4.87
Extra Trees Regressor	B	0.7064	13.85	3.72
Random Forest Regressor	B	0.7034	13.99	3.74
XGBoost Regressor	B	0.6731	15.41	3.93
LightGBM Regressor	B	0.6682	15.65	3.96
Gradient Boosting Regressor	B	0.4974	23.70	4.87
Extra Trees Regressor	C	0.9129	1.09	1.05
Random Forest Regressor	C	0.8984	1.27	1.13
XGBoost Regressor	C	0.8538	1.83	1.35
LightGBM Regressor	C	0.8401	2.01	1.42
Gradient Boosting Regressor	C	0.6236	4.72	2.17

Table 2. Stratification with CVSR results.

Model	Dataset	R²	MSE	RMSE
LightGBM Regressor	A	0.5639	19.25	4.39
Extra Trees Regressor	B	0.7220	13.11	3.62
Extra Trees Regressor	C	0.9133	1.09	1.04

Table 3. Feature aggregation with ACPD results.

Model	Dataset	R²	MSE	RMSE
LightGBM Regressor	A	0.6047	17.45	4.18
Extra Trees Regressor	B	0.7578	11.42	3.38
Extra Trees Regressor	C	0.9196	1.01	1.00

Table 4. Optimization with SMBO results.

Model	Dataset	R²	MSE	RMSE
Extra Trees Regressor	B	0.7588	11.38	3.37
Extra Trees Regressor	C	0.9204	1.00	1.00
LightGBM Regressor	A	0.6168	16.92	4.11

Table 5. Dataset A: overall results.

Seq	Parameter	Explored Values	Optimal Value	Best Score (R²)
1	num_leaves	31, 15, 50, 100	50	0.6055
2	learning_rate	0.1, 0.01, 0.05, 0.2	0.1	0.6055
3	subsample	1.0, 0.8, 0.6	1.0	0.6055
4	colsample_bytree	1.0, 0.8, 0.6	0.6	0.6134
5	n_estimators	100, 200, 300, 500, 700, 1000	200	0.6168

Table 6. Datasets B and C: overall results.

Seq	Parameter	Explored Values	Optimal Value (B)	Best Score (R²) (B)	Optimal Value (C)	Best Score (R²) (C)
1	max_depth	None, 10, 20, 30	None	0.7578	30	0.9203
2	max_features	1, sqrt, log2	1	0.7578	1	0.9203
3	min_samples_split	2, 5, 10	5	0.7581	2	0.9203
4	min_samples_leaf	1, 2, 4	1	0.7581	1	0.9203
5	n_estimators	50, 100, 200, 300, 400, 500	500	0.7588	200	0.9204

Table 7. Hyperparameters explored and optimal values identified by SMBO.

Dataset	Model	Hyperparameter	Explored Values	Optimal Value
A	LightGBM Regressor	num_leaves	[15, 31, 50, 100]	50
		learning_rate	[0.01, 0.05, 0.1, 0.2]	0.1
		subsample	[0.6, 0.8, 1.0]	1.0
		colsample_bytree	[0.6, 0.8, 1.0]	0.6
		n_estimators	[100, 200, 300, 500, 700, 1000]	200
B	Extra Trees Regressor	max_depth	[None, 10, 20, 30]	None
		max_features	[1, sqrt, log2]	1
		min_samples_split	[2, 5, 10]	5
		min_samples_leaf	[1, 2, 4]	1
		n_estimators	[50, 100, 200, 300, 400, 500]	500
C	Extra Trees Regressor	max_depth	[None, 10, 20, 30]	30
		max_features	[1, sqrt, log2]	1
		min_samples_split	[2, 5, 10]	2
		min_samples_leaf	[1, 2, 4]	1
		n_estimators	[50, 100, 200, 300, 400, 500]	200

Table 8. Comparative analysis table for datasets A, B, and C.

Dataset	Ref	Key Methodology	Best R²	Advancements in Our Study
A	[26]	Exploitative ML	61.7%	ACPD, stratification, improved scalability
B	[2]	Exploitative ML	75.9%	ACPD, SMBO, enhanced generalizability
C	[20]	High-throughput ML	92.0%	Stratified regression, flexible and adaptable approach
Meta-analysis OCM dataset	[7]	High-throughput ML	89.3%	Multi-component catalyst optimization, extended dataset
Updated OCM Literature Dataset	[14]	Deep Learning and ML	80.5%	Heuristic-driven catalyst design, model fine-tuning
Bias-Free OCM Catalyst Dataset	[17]	DFT and Data-Driven ML	88.1%	Advanced feature selection, improved reaction modeling
Machine Learning Engineering OCM Dataset	[21]	Bias-Free ML Approach	91.8%	Dataset expansion, robust validation

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ezz, M.; Mostafa, A.M.; S. Alaerjan, A.; Allahem, H.; Aldughayfiq, B.; M. A. Hassan, H.; M. K. Mohamed, R. Aggregated Catalyst Physicochemical Descriptor-Driven Machine Learning for Catalyst Optimization: Insights into Oxidative-Coupling-of-Methane Dynamics and C2 Yields. Catalysts 2025, 15, 378. https://doi.org/10.3390/catal15040378

AMA Style

Ezz M, Mostafa AM, S. Alaerjan A, Allahem H, Aldughayfiq B, M. A. Hassan H, M. K. Mohamed R. Aggregated Catalyst Physicochemical Descriptor-Driven Machine Learning for Catalyst Optimization: Insights into Oxidative-Coupling-of-Methane Dynamics and C2 Yields. Catalysts. 2025; 15(4):378. https://doi.org/10.3390/catal15040378

Chicago/Turabian Style

Ezz, Mohamed, Ayman Mohamed Mostafa, Alaa S. Alaerjan, Hisham Allahem, Bader Aldughayfiq, Hassan M. A. Hassan, and Rasha M. K. Mohamed. 2025. "Aggregated Catalyst Physicochemical Descriptor-Driven Machine Learning for Catalyst Optimization: Insights into Oxidative-Coupling-of-Methane Dynamics and C2 Yields" Catalysts 15, no. 4: 378. https://doi.org/10.3390/catal15040378

APA Style

Ezz, M., Mostafa, A. M., S. Alaerjan, A., Allahem, H., Aldughayfiq, B., M. A. Hassan, H., & M. K. Mohamed, R. (2025). Aggregated Catalyst Physicochemical Descriptor-Driven Machine Learning for Catalyst Optimization: Insights into Oxidative-Coupling-of-Methane Dynamics and C2 Yields. Catalysts, 15(4), 378. https://doi.org/10.3390/catal15040378

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Aggregated Catalyst Physicochemical Descriptor-Driven Machine Learning for Catalyst Optimization: Insights into Oxidative-Coupling-of-Methane Dynamics and C2 Yields

Abstract

1. Introduction

2. Related Work

3. Results

3.1. Experiment 1: Baseline Model Evaluation

3.2. Experiment 2: Incorporating Stratification with CVSR

3.3. Experiment 3: Feature Aggregation with ACPDs

3.4. Experiment 4: Optimization with SMBO

3.5. Summary of Achievements

3.6. Comparison with Related Work

4. Discussion

4.1. Detailed Catalyst Feature Influence Unraveled by SHAP Analysis

4.2. Stratified Insight Through Decision Plot Analysis

4.3. Deciphering Feature Impacts on Catalytic Efficiency Through SHAP Values

4.3.1. Impact of Methane-to-Oxygen Ratio on Predictive Accuracy

4.3.2. Lithium’s Role in Catalyst Performance

4.3.3. Temperature as a Determinant of Catalytic Activity

4.3.4. Sodium’s Contribution to Yield Predictions

4.3.5. Influence of Lanthanum on Model Predictions

4.4. Practical Catalyst Design Recommendations and Experimental Optimization

5. Materials and Methods

5.1. Dataset

5.2. Dataset Preprocessing

5.3. Aggregated Catalyst Physicochemical Descriptor (ACPD)

5.4. Different Model Evaluations

5.5. Cross-Validation with the Stratified Regression (CVSR) Technique

5.6. Modified Sequential Model-Based Optimization (SMBO) for Hyperparameter Tuning

5.7. Performance Metrics

5.8. Catalyst Anatomy via SHAP Explainer

5.8.1. Summarizing Feature Effects

5.8.2. Catalyst Feature Contributions

5.8.3. Relationship Exploration

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI