1. Introduction
The exploration of innovative methods to convert methane—the most fundamental alkane and a key component of natural gas—into higher-value chemicals remains a critical area in contemporary chemical research. Among these methods, the oxidative coupling of methane (OCM) emerges as a promising strategy, offering a direct route to ethylene, a vital precursor in the petrochemical industry. However, the OCM process is not without its challenges, such as limited selectivity, modest conversion rates, and the requirement for elevated operational temperatures. These hurdles underscore the pressing need for the development of superior catalysts and optimized reaction conditions to improve the efficiency and commercial viability of OCM processes.
In this study, we aim to address these complexities by leveraging machine learning (ML) to gain deeper insights into catalytic systems and to optimize C2 yields. The following critical research questions guide our investigation:
Model Selection and Stratified Validation: How does the selection of an optimal machine learning model, combined with Cross-Validation using Stratified Regression, enhance our ability to predict C2 yields? Additionally, what insights can this provide into the target variable’s behavior across diverse catalyst compositions and conditions?
Feature Aggregation and Hyperparameter Optimization: How does the innovative Aggregation of Physicochemical Descriptors (ACPDs) streamline feature representation, and how does hyperparameter tuning using frameworks, like Modified Sequential Model-Based Optimization (SMBO), improve the predictive model’s performance? Furthermore, which features and hyperparameters are identified as the most influential, and how do they impact the model’s interpretability and accuracy?
SHAP Analysis for Interpretability: How does SHAP (SHapley Additive exPlanations) analysis provide comprehensive insights into the contributions of individual features to the model’s predictions? What can this analysis reveal about the complex interactions and nonlinear relationships among features, and how can these findings inform catalyst design and operational optimization?
These research questions form the foundation of a systematic approach that spans from model selection and validation to feature engineering and in-depth interpretative analysis. By integrating state-of-the-art machine learning techniques with rigorous evaluation and interpretability frameworks, this study aims not only to enhance the predictive accuracy of C2 yield models, but also to provide actionable insights into the rational design of catalysts and the optimization of OCM processes. In seeking answers to these questions, this research builds upon the transformative advancements in machine learning (ML) applications for catalysis research, which have unlocked novel insights into data patterns often inaccessible through traditional experimentation. Inspired by the predictive power demonstrated in [
1,
2], this study aims to elucidate the intricate interplays within catalyst systems to identify groundbreaking catalysts that can revolutionize the OCM process. The insights drawn from these works underscore the ability of ML to address the inherent complexities of catalytic systems, paving the way for enhanced performance and efficiency.
Additionally, the methodologies in this research are informed by contemporary studies that highlight ML’s role in heterogeneous catalyst design, such as the approaches detailed in [
3], and the integration of ML with Density Functional Theory (DFT) calculations to predict catalytic performance, as explored in [
4]. These examples demonstrate the capability of ML to unravel complex interactions and provide actionable insights that are integral to advancing the OCM process. Furthermore, kinetic modeling and high-throughput experimental techniques have significantly contributed to the understanding of OCM, as evidenced by recent research efforts. For example, the optimization of process variables and their effects on OCM efficiency, as discussed in [
5], offer a foundational perspective for refining operational conditions. Similarly, the work presented in [
6], which explores the impact of Mg, Ca, Sr, and Ba dopants on La
2O
3 catalysts, provides valuable insights into how compositional changes influence catalytic performance. These advancements underscore the critical role of empirical evaluations in complementing data-driven predictions to refine the discovery of catalysts and optimize processes.
The oxidative coupling of methane (OCM) is fundamentally complex due to the delicate balance required to selectively activate methane molecules for the formation of valuable ethylene (C2) while minimizing over-oxidation to undesired byproducts, such as carbon dioxide (CO2) and carbon monoxide (CO). Catalyst effectiveness heavily relies on the intricate interplay between catalyst composition, surface properties, and reaction conditions. The elevated temperatures are typically necessary for OCM to accelerate both desired and undesired pathways, highlighting the importance of precise control over reaction parameters. Moreover, achieving high selectivity toward ethylene remains a challenge due to competing side reactions, catalyst deactivation, and thermal instability. Thus, advanced catalytic systems informed by detailed insights into catalyst structure–performance relationships and reaction dynamics are crucial for the practical and commercially viable operation of OCM.
By synthesizing these diverse research strands, this study aims to develop a comprehensive framework that bridges machine learning predictions, kinetic modeling, and experimental data. This interdisciplinary approach not only enhances the predictive power of ML models, but also provides actionable insights to inform the design and empirical validation of novel catalysts. In doing so, it aspires to transform the OCM process from a promising theoretical concept into a practical, commercially viable technology.
2. Related Work
This paper enhances the oxidative-coupling-of-methane (OCM) process using machine learning (ML). The goal is to increase the production of ethylene, a valuable chemical. This research focuses on developing a predictive model to optimize catalyst composition and reaction conditions by analyzing various factors influencing C2 yields. By combining machine learning with kinetic modeling and experimental data, this study aims to enhance the overall efficiency and commercial viability of the OCM process. The authors of [
1] evaluated machine learning (ML) predictions for OCM under experimental conditions, discovering previously unreported catalyst combinations and assessing the challenges in achieving higher C2 yields through a literature-data-driven ML approach. This study emphasized the need for improved ML prediction accuracy through better classification and the trend analysis of the experimental data. An updated dataset of 4759 experimental data points was constructed and analyzed using several ML methods. This study highlighted the exploration of Mn/Na
2WO
4/SiO
2 catalyst systems and identified promising candidates for catalytic systems through machine learning (ML) models [
2]. As presented in [
3], the authors assessed ML-assisted catalyst investigations for OCM using published datasets and systematic high-throughput screening to uncover insights into heterogeneous catalyst design. Furthermore, the authors of [
4] utilized ML models and DFT calculations to predict catalytic performance and propose novel bimetallic combinations for effective methane conversion at low temperatures.
As presented in [
5], the authors employed high-throughput screening (HTS) and statistical design of experiments (DoE) to investigate the impact of operating parameters on OCM, aiming to optimize the process and enhance the yield and selectivity of C2 products. The authors of [
6] investigated lanthanum oxide catalysts enhanced with alkaline–earth metal oxides for OCM, identifying La-Sr and La-Ba as promising catalysts. In addition, the authors of [
7] applied multi-component La
2O
3-based catalysts for OCM designed using HTS and literature datasets with ML approaches. This study aimed at improving catalyst design through data-driven methods. As presented in [
8], catalyst informatics are combined with high-throughput experimental data to understand the OCM reaction. Machine learning was employed to bridge the gaps between the experimental data points, facilitating a detailed understanding of the reaction. The authors of [
9] combined deep learning with high-throughput experimental methods to identify and evaluate active catalysts for OCM, discovering highly active, previously unreported catalysts that mark a milestone in the application of AI in catalysis.
In addition, the authors of [
10] reported the use of Pt nanoparticles and CuOx clusters on TiO
2 for OCM in a flow reactor at room temperature, achieving a high yield and selectivity of C2 products, while the authors of [
11] utilized experimental data alongside data science techniques to map out the OCM reaction network, establishing a connection between experimental conditions and selectivity, thereby enriching the strategic approach to catalyst and process optimization. The impact of mass and heat transfer was addressed in a reactor design for OCM, emphasizing the importance of understanding these effects for the development of efficient catalytic processes as proposed in [
12]. The authors of [
13] also reported the high catalytic activity of Li
2CaSiO
4 for OCM, comparing its performance with traditional catalysts and suggesting a crystallographic design concept for OCM catalysts. Furthermore, the authors of [
14] analyzed a diverse catalyst dataset from high-throughput experiments, extracting valuable design heuristics for developing catalysts with balanced activity and selectivity, illustrating the power of data analysis in guiding catalyst development strategies, especially highlighting a mixed support approach between La
2O
3 and BaO as a fruitful direction for future catalyst development. Delving into the limitations of the MnO
x-Na
2WO
4/SiO
2 system for OCM at low temperatures, Si et al. [
15] not only identified the challenges but also proposed a variant of the catalyst that is active at lower temperatures. This study sheds light on the intricate relationship between catalyst composition and operational temperature, providing valuable insights into overcoming performance bottlenecks.
The field of oxidative coupling of methane (OCM) has seen significant advancements through the integration of machine learning (ML) techniques for catalyst discovery, optimization, and the understanding of reaction mechanisms. The contributions and findings of recent pivotal studies are presented, providing a comprehensive overview that illustrates the progression and current state of machine learning (ML) applications in the OCM research. As presented in [
1], the authors undertook a literature-data-driven machine learning (ML) exploration to predict OCM outcomes under various experimental conditions. Their analysis uncovered novel catalyst combinations and pinpointed obstacles to achieving higher C2 yields. A crucial takeaway was the imperative to enhance ML prediction capabilities through more refined data classification and trend analysis, aiming for a more precise identification of promising catalysts. Furthermore, the authors of [
2] constructed an extensive dataset comprising 4759 experimental data points, scrutinized through various ML methodologies. It underscored the potential of Mn/Na
2WO
4/SiO
2 catalyst systems, revealing new catalyst candidates for OCM, thereby demonstrating the power of extrapolative ML methods in catalyst exploration. By employing published datasets and systematic high-throughput screening, Nishimura et al. [
16] advanced the understanding of a heterogeneous catalyst design for OCM. This approach facilitated the discovery of insightful correlations and design principles, demonstrating the utility of machine learning in streamlining catalyst evaluation processes. The collaborative effort led by Nishimura et al. [
7] focused on the data-driven design of La
2O
3-based catalysts, utilizing HTS and ML to navigate the complex landscape of catalyst optimization, highlighting the efficacy of integrating computational and experimental strategies. The authors of [
8] combined catalyst informatics with high-throughput experimental data to elucidate the gas-phase OCM mechanism. ML filled the gaps in the experimental datasets, providing a detailed reaction landscape and fostering a deeper understanding of OCM dynamics. In addition, the authors of [
17] integrated ML models and DFT calculations to predict catalytic performance, proposing novel bimetallic catalysts for efficient methane conversion at lower temperatures. This study highlighted the synergistic potential of computational chemistry and ML in catalysis.
As presented in [
10], the authors reported on Pt nanoparticles and CuOx clusters on TiO
2 for OCM, achieving high yields and selectivities for C2 products at room temperature in a flow reactor, thereby showcasing the potential of photocatalysis in OCM. In contrast, the authors of [
18] explored the effects of Mg, Ca, Sr, and Ba dopants on La
2O
3 catalysts, identifying La-Sr and La-Ba combinations as particularly promising, offering new avenues for catalyst enhancement. The authors of [
19] employed data science and machine learning (ML) to investigate and identify suitable catalysts for the oxidative coupling of methane (OCM). A review of 1868 OCM catalysts from the literature data seeks to identify essential descriptors that affect C2 yield production during the OCM reaction. The trained machine learning model predicts 56 new catalysts whose first-principles calculations have proven to activate CH
4, CH
3, and O
2 while achieving C2 yields exceeding 30%. The research demonstrates that machine learning has the potential to enhance the discovery and design of catalysts for OCM and similar chemical processes. The authors of [
1] conducted an evaluation of machine learning (ML) predictions for the oxidative coupling of methane (OCM) by testing 96 catalysts obtained from the literature and conducting verification trials. The research identifies new catalyst pairs, but faces difficulties in achieving C2 yields beyond 30% due to changes in the reactor system. The study demonstrates that data classification techniques in high dimensions enable ML predictions to outperform basic interpolation methods in accuracy.
As presented in [
20], a high-throughput screening system was developed for oxidative coupling of methane (OCM), generating a dataset of 12,708 data points for 59 catalysts. It highlights the importance of optimizing both catalyst and reactor design to improve C2 yield and demonstrates that a consistent dataset enables accurate yield predictions using ML. As presented in [
21], the authors explored the limitations of a combinatorial catalyst design through the random sampling of 300 quaternary solid catalysts from 36,540 possible candidates to evaluate their performance in the oxidative coupling of methane (OCM) using high-throughput screening. The research demonstrates that synergistic element pairings are crucial to catalyst design, and these pairings can be grouped according to the periodic table’s organization. Decision tree classification serves as a method to enhance the selection of catalysts that produce higher C2 yields. In addition, the authors of [
22] evaluated the importance of isothermal data when selecting oxidative-coupling-of-methane (OCM) catalysts that operate in adiabatic reactors. Under oxygen-lean conditions, where CH
4/O
2 ratios exceed 7, the catalyst performance can be more accurately estimated because lower surface oxygen levels result in higher C2+ selectivity. The research provides experimental recommendations and data analysis findings that will aid future benchmarking efforts in industrial settings. The authors of [
23] introduced a meta-analysis procedure to discover catalyst performance–property relationships through data synthesis from the literature and textbooks combined with statistical methods. The approach demonstrates that optimal catalysts for the oxidative coupling of methane require stable carbonate structures in conjunction with stable oxide support materials for successful operation. The authors of [
24] introduced an optimization method that combines kinetic simulations with statistical analysis to develop heterogeneous catalysts. The research demonstrates that the electronic properties of surface oxygen species play a crucial role in determining catalyst performance during the oxidative coupling of methane (OCM). Furthermore, the authors of [
25] developed a single method that enables the interpretation of intricate machine learning model predictions. SHAP applies importance values to features for specific predictions while establishing a new system of additive feature importance metrics. The introduction of SHAP-based techniques improves both computational speed and human-intuitive consistency, enabling a better balance of accuracy and interpretability.
Studies have consistently shown that OCM performance is sensitive to several critical chemical parameters, including catalyst elemental composition, oxidation states, catalyst-support interactions, and structural stability under reaction conditions. Catalyst promoters, such as alkali metals (Li, Na) and alkaline earth metals (Ba, Sr), have been frequently utilized to enhance ethylene selectivity by modifying the catalyst surface properties and reaction pathways. However, their precise roles vary significantly depending on catalyst compositions, highlighting the complex chemical environment within OCM systems. Additionally, recent experimental and kinetic modeling studies have highlighted the impact of heat and mass transfer limitations on catalyst performance, noting that these factors can significantly influence reaction rates, selectivity, and catalyst lifetime. Therefore, coupling detailed chemical experimentation with computational predictive models is essential to advance catalyst discovery and optimize operational parameters in the OCM research.
These studies collectively underscore the transformative impact of machine learning and computational approaches on the field of OCM, enabling the discovery of novel catalysts, optimization of reaction conditions, and a deeper understanding of the underlying reaction mechanisms. By leveraging big data and high-throughput experimentation, as well as integrating advanced computational models, the research community is progressively unveiling the complexities of OCM, pushing the boundaries of what is achievable in catalytic methane conversion. This body of work not only highlights the importance of interdisciplinary approaches in catalysis research, but also sets a foundation for future explorations aimed at sustainable and efficient methane utilization strategies.
4. Discussion
In this section, we delve into the interpretability of our predictive model using the SHAP (SHapley Additive exPlanations) analysis, focusing on the final model that combines stratification, ACPD feature aggregation, and SMBO. This final model was trained on dataset B, which was chosen for its rich diversity in elemental compositions compared to other datasets, providing a broader scope for exploration and analysis. The application of SHAP analysis provides a powerful lens for understanding the impact of individual features on the model’s predictions, enabling us to uncover the intricate relationships between variables and their contributions to catalytic performance.
4.1. Detailed Catalyst Feature Influence Unraveled by SHAP Analysis
The elucidation of our model’s decision-making process, achieved through SHAP Explainer, revealed the comprehensive anatomy of the catalyst’s influential features. The SHAP analysis meticulously dissected the influence hierarchy of the catalyst features within the model, revealing a layered structure of impact and significance. The integration of
Figure 2 into our discussion offers a compelling visual narrative that pairs the average impact of features with their distribution and variability across predictions.
Figure 2 concisely conveys the average magnitude of each feature’s impact on the model output. The methane-to-oxygen partial pressure ratio
emerges as the most influential factor, corroborating the critical role of this ratio in the catalytic conversion processes. The pre-eminence of this ratio highlights its dual function: serving as a proxy for reactant availability and an indicator of reaction dynamics.
Adding to the catalytic narrative, the bar plot spotlights the importance of elemental additives, such as lithium (+Li) and sodium (+Na). Lithium’s substantial impact suggests its significant role in altering catalyst properties and, consequently, the reaction pathway. Similarly, sodium’s prominence indicates its potential to modulate the catalyst surface interactions, resonating with the nuanced balance of catalytic elements within the reaction milieu.
Figure 3 complements this by offering a granular depiction of each feature’s contribution across individual predictions. The plot’s color intensity and swarm distribution attract attention to the feature value’s role in prediction deviations. For instance, higher temperatures consistently drive higher SHAP values, underscoring the temperature’s critical role in facilitating reaction kinetics. Furthermore, the beeswarm plot reveals the nonlinear and complex interactions between features. The plot showcases how certain features, such as contact time and preparation method (Preparation_n.a.), exhibit varied impacts across different observations, hinting at underlying complexities in the catalysis process not captured by average effects alone.
The aggregation of minor feature impacts, depicted as the “Sum of 71 other features” presented in
Figure 2, indicates a collective significance that rivals the top features. This aggregation underscores the multifaceted nature of the catalytic process, where numerous subtler factors collectively shape the predictive model’s output. Together, these figures not only illuminate the paramount features driving the predictive model, but also expose the breadth and depth of the feature interactions. This enhanced understanding directs us to refine the catalyst system, focusing on the optimization of key influential parameters, and offers insights into the potential synergy of combined feature effects.
This detailed analysis lays a foundation for advancing the interpretability of complex catalytic systems and demonstrates the value of machine learning as a tool for hypothesis generation in the catalytic research. The nuanced understanding gleaned from the SHAP analysis equips us to further the frontier of catalyst design, tailoring materials for specific reactions and operational conditions.
4.2. Stratified Insight Through Decision Plot Analysis
In this subsection, we dive deeper into the nuanced and collective influences of catalyst features across different performance strata by utilizing decision plots generated from SHAP values. Recognizing the complexity of OCM catalytic systems, these decision plots illustrate clearly how multiple catalyst parameters interact differently across high-yield (>25%), mid-yield (20–25%), and low-yield (15–20%) groups. It is essential to emphasize that the features shown in these decision plots, such as the methane-to-oxygen ratio and the presence of Li, Na, Ba, and others, are analyzed in the context of their collective contribution rather than as isolated elements. Different performance strata indeed highlight distinct combinations of influential parameters, revealing how specific catalyst formulations and reaction conditions synergistically define catalytic performance. This stratified analysis acknowledges the complex, context-dependent nature of catalytic effectiveness. It provides practical insights into how varying compositions and operational parameters must be collectively optimized within distinct yield ranges.
Specifically, yields exceeding 25% are often considered excellent and indicative of highly efficient catalyst formulations, reflecting the upper-performance range frequently reported as a goal in the OCM literature [2, 7, and 14]. Yields in the range of 20–25% represent a moderate-to-good performance, typical of many catalysts reported in the experimental studies, and thus reflect a critical target range for optimization efforts. Meanwhile, yields in the range of 15–20% are indicative of a modest catalyst performance, still within the realm of practical experimental relevance, but generally signaling opportunities for further optimization or improvement. These thresholds thus provide a meaningful segmentation of catalytic performance, allowing for a more targeted analysis of feature influences and interactions across distinct performance levels.
High-Yield-Group (>25%) Decision Plot: Figure 4 explains the decision plot for samples with both predicted and actual yields exceeding 25% distinctly highlights how multiple catalyst parameters collectively result in a superior catalytic performance. Notably, the methane-to-oxygen ratio (p(CH
4)/p(O
2)), sodium (+Na), lithium (+Li), methane partial pressure, and reaction contact time emerge as critical parameters that synergistically influence the model’s output. Rather than acting in isolation, these features function interdependently, indicating their joint roles in establishing conditions that significantly enhance catalytic efficiency. This collective feature interplay underscores the importance of carefully balanced catalyst formulations and reaction parameters tailored specifically to achieve high yields.
- 2.
Mid-Yield-Group (20–25%) Decision Plot: Within the mid-yield group as explained in
Figure 5, a complex web of interdependent features is revealed, emphasizing the nuanced interplay necessary for a moderate catalytic performance. The methane-to-oxygen ratio, lithium (+Li), sodium (+Na), methane partial pressure, temperature, and contact time notably combine to guide the model’s predictions. Unlike the straightforward interactions observed in the high-yield group, this range indicates a delicate balance, where slight variations in catalyst preparation and operational conditions markedly influence the outcomes. This highlights the need for the careful optimization of catalyst composition and process conditions to sustain a consistent performance within this intermediate yield range.
- 3.
Lower-Yield-Group (15–20%) Decision Plot: The decision plot associated with the lower-yield group as presented in
Figure 6 indicates a notable shift in the feature influence landscape. The complexity of catalyst performance becomes increasingly apparent. Here, subtle shifts in multiple features—including the methane-to-oxygen ratio, lithium (+Li), sodium (+Na), barium (+Ba), temperature, catalyst preparation methods, and lanthanum (+La)—collectively exert a considerable influence on yields. The decision plot distinctly shows that achieving even a moderate performance requires precise coordination among numerous parameters. This underscores the critical nature of fine-tuning catalyst composition and operational conditions, especially at lower yield levels, to enhance catalytic effectiveness.
Across all groups, the trajectory lines in the decision plots weave a rich narrative of how cumulative feature effects lead to the final prediction, with each step representing a feature’s incremental impact. The ‘High-Yield-Group’ plot’s dense and steep trajectory underscores the potent combination of high reactant availability and favorable catalytic conditions. In contrast, the more variable and less steep trajectories observed in the ‘Mid-’ and ‘Lower-Yield Groups’ reflect a more delicate balance of conditions necessary to achieve moderate yields. The vertical dispersion of the lines within the decision plots illustrates the diversity of feature interactions within similar yield ranges. It denotes that while certain features have a consistent directional influence across all samples, their magnitudes can vary, indicating differential sensitivities within the operational range. The decision plot analysis, tailored to yield-specific sample groupings, unveils a stratified model, showcasing how different catalyst features drive the model’s output at varying levels of C2 yield performance.
4.3. Deciphering Feature Impacts on Catalytic Efficiency Through SHAP Values
In this subsection, we delve into the interaction between key catalyst features and their SHAP values to elucidate their influence on the prediction of C2 yields. By examining scatter plots, which juxtapose feature values against SHAP values, the relationships that define the model’s predictive behavior are explained.
4.3.1. Impact of Methane-to-Oxygen Ratio on Predictive Accuracy
As presented in
Figure 7, the scatter plot analysis of the methane-to-oxygen partial pressure ratio
reveals a compelling, inverse relationship with SHAP values, painting a complex picture of its role in catalytic efficiency. As the
ratio increases, we initially observe a significant decrease in SHAP values, indicating a robust negative impact on the predicted yields. Notably, this downward trend in SHAP values sharply reverses at a ratio close to 5, suggesting a pivotal threshold for catalytic behavior. This trend reaches a zenith of the SHAP value at approximately 1.3, which could represent an optimal ratio for catalytic efficiency, before declining as the ratio continues to rise. Such a finding is instrumental, as it suggests that there exists a specific operational window wherein the reactant balance is most conducive for yield optimization. These insights could lead to critical operational adjustments in OCM processes, particularly in fine-tuning the gas feed composition. The ability to pinpoint this nonlinear optimal region (~1.3–5) and the clear threshold around a ratio of ~5, where catalytic contributions shift significantly, enables more precise resource utilization and critical operational adjustments, potentially leading to significant enhancements in catalytic performance and yield optimization.
4.3.2. Lithium’s Role in Catalyst Performance
As presented in
Figure 8, the scatter plot for lithium’s presence in the catalyst matrix indicates a mostly positive correlation with SHAP values, suggesting an increase in the predicted C2 yield up to a lithium concentration of about 17. Beyond this point, further increases in lithium concentration appear to diminish its positive impact, as evidenced by a decrease in SHAP values. The optimal influence of lithium on catalytic performance is achieved at a SHAP value around 17, after which its efficacy in enhancing the yield predictions declines. This analysis points to the existence of an optimal lithium concentration for maximizing the catalyst’s efficiency, providing a precise target for catalytic formulation and design. Clearly identifying this nonlinear behavior with an optimal catalytic threshold around ~17% lithium content, beyond which the performance gains diminish, provides precise guidance for catalyst formulation and enhances yield optimization strategies.
4.3.3. Temperature as a Determinant of Catalytic Activity
Figure 9 presents a scatter plot for temperature, which indicates that as the temperature rises, there is a corresponding increase in SHAP values, suggesting a favorable impact on the model’s yield predictions up to approximately 990 Kelvin. Past this temperature, the SHAP values demonstrate increased variability, which may reflect a complex interplay of thermal effects on the catalytic process. The highest SHAP value achieved is observed around 1050 Kelvin, pinpointing a specific temperature at which the catalyst’s performance is maximized. This insight into the temperature dependence of catalytic efficiency highlights the criticality of maintaining an optimal temperature range for achieving the best possible yield in catalytic reactions. Explicitly recognizing this nonlinear relationship with a distinct operational threshold around ~990 K, beyond which catalytic predictions strongly improve, underscores the critical importance of precise thermal control for optimizing catalytic efficiency.
4.3.4. Sodium’s Contribution to Yield Predictions
As explained in
Figure 10, the +Na scatter plot reveals that increases in sodium content are initially associated with negative impacts on the model’s yield predictions. This pattern of declining SHAP values suggests that higher concentrations of sodium may impede the catalytic efficiency. However, a noticeable plateau in the SHAP values occurs beyond a certain level of sodium presence, indicating a saturation point where additional sodium no longer significantly affects yields. Intriguingly, the highest SHAP value for sodium is reached at around a concentration of 5, highlighting this as the point where sodium’s contribution to catalytic activity is maximized. These data illustrate the nuanced role of sodium in catalysis, where it may have beneficial effects at lower concentrations but deleterious effects when overrepresented, emphasizing the importance of balancing sodium levels within the catalyst formulation. Clearly identifying this nonlinear trend and the optimal catalytic threshold (~5% sodium concentration), beyond which additional sodium yields diminishing returns, facilitates targeted adjustments in catalyst composition to maximize efficiency.
4.3.5. Influence of Lanthanum on Model Predictions
The scatter plot for lanthanum’s impact on catalytic performance, as shown in
Figure 11, indicates that moderate levels of lanthanum correspond with positive SHAP values, suggesting an enhancement in yield predictions up to a certain point. As the concentration of lanthanum increases, the magnitude of its beneficial impact begins to wane, implying that there is a threshold of effectiveness. This diminishing return on SHAP values at higher lanthanum concentrations implies that while lanthanum can act as a catalyst activator, its effectiveness plateaus at a concentration beyond which it may become detrimental to the yield. The highest SHAP value is observed at a lanthanum concentration of 10, pinpointing this level as the potential sweet spot for its catalytic contribution. Clearly recognizing this nonlinear behavior with an optimal threshold near ~10% lanthanum concentration, beyond which additional lanthanum does not substantially enhance the performance, provides valuable insights into strategic catalyst optimization.
In revising the discussion to specifically address amalgamated Research Question 3, we expand our understanding of how SHAP value analysis elucidates the impact of catalyst components and operational parameters on C2 yield predictions within oxidative-coupling-of-methane (OCM) reactions. This comprehensive SHAP analysis insightfully decodes the contributions and interactions of crucial features, enhancing model interpretability and guiding catalytic optimization.
Our consolidated SHAP value analysis delves into the significant roles of specific features, like the methane-to-oxygen ratio , lithium (), and reaction temperature, among others, in influencing the model’s predictive outcomes. This analysis uncovered several critical insights:
The methane-to-oxygen ratio emerges as a pivotal factor, where an optimal range is essential for maximizing catalytic efficiency. An increase in this ratio boosts the model’s yield predictions up to a certain threshold, beyond which the effect inversely impacts the yield, marking a delicate balance in operational parameters.
Lithium content within the catalyst matrix demonstrates a positive correlation with SHAP values up to a specific concentration, suggesting an optimal lithium level that facilitates catalytic activity without leading to diminishing returns.
The influence of reaction temperature on SHAP values reveals an optimal operational temperature that significantly enhances yield predictions, highlighting temperature’s critical role in catalysis.
Through the lens of SHAP scatter plots and decision plots, a nuanced understanding of nonlinear relationships and interaction effects among features is achieved. For example:
Sodium (Na) and lanthanum (La) content exhibit complex, nonlinear relationships with yield predictions, with their positive impacts plateauing at certain concentrations. This indicates a nuanced influence, where too much or too little can affect the yield outcomes.
The decision plots provide a stratified view of how different feature levels impact yield predictions across various ranges, offering a granular perspective that aids in identifying precise operational and compositional optimizations for enhanced catalytic performance.
These insights from the SHAP analysis significantly contribute to our understanding of the multifaceted nature of catalyst behavior. They reveal not only the key drivers of yield, but also the intricate interactions between various catalyst components and operational conditions. This enhanced model transparency, facilitated by SHAP analysis, opens new avenues for data-driven catalyst design and operational optimizations, crucial for the progression of OCM technologies. In summary, the SHAP analyses offer a profound validation of our model’s predictive capacity and furnish a detailed map of feature influences. This not only bridges the gap between theoretical predictions and practical catalysis applications, but also sets a benchmark for future catalyst development and reaction condition optimization efforts.
4.4. Practical Catalyst Design Recommendations and Experimental Optimization
Based on the detailed SHAP analyses presented, explicit recommendations for catalyst optimization and experimental guidance are clearly identified:
Methane-to-Oxygen Partial Pressure Ratio (p(CH4)/p(O2)): Optimal catalyst performance is explicitly achieved within a methane-to-oxygen ratio window in the range of approximately 1.3–5. Operationally, maintaining this ratio below the clear threshold (~7) is critical, as higher ratios significantly reduce catalytic efficiency. Thus, experimental optimization should prioritize this optimal operational window to enhance the yield.
Lithium Concentration (+Li): An optimal lithium concentration is clearly identified at approximately 17%, beyond which catalytic benefits diminish substantially. Catalyst designs should explicitly target this lithium concentration range experimentally, with precise compositional adjustments to maximize efficiency and resource use.
Reaction Temperature: Clearly optimal catalytic conditions are identified at reaction temperatures in the range of 990–1050 K. Experimental strategies should explicitly focus on maintaining this thermal window to optimize catalytic efficiency, as deviations significantly influence the performance negatively.
Sodium Concentration (+Na): Optimal sodium concentration is clearly identified around 5%, with clear diminishing returns beyond this threshold. Experimental catalyst design must therefore explicitly consider careful sodium content optimization within this nonlinear optimal concentration range.
Lanthanum Concentration (+La): Optimal lanthanum content explicitly occurs around 10%, beyond which additional lanthanum contributes minimal or negative returns. Experimentally, catalyst formulations must explicitly focus on this identified optimal lanthanum content threshold for an enhanced performance.
These explicitly identified optimal conditions and thresholds derived directly from SHAP insights offer valuable practical guidance. Future experimental studies should leverage these specific recommendations to strategically design catalysts and optimize operational conditions, thereby clearly maximizing the catalytic performance and resource efficiency.
Overall, the SHAP-based feature interpretations reinforce the complexity inherent in OCM catalysis, clearly illustrating how catalytic efficiency emerges from the synergistic interactions among multiple chemical elements and operational conditions. These interactions extend beyond simple additive effects, often involving nonlinear and context-dependent relationships that necessitate meticulous experimental validation. The insights obtained here provide critical guidance for future studies aimed at experimentally verifying and refining these predictive insights, ultimately enhancing the rational design of advanced catalyst materials optimized for industrial OCM processes.
This research displays significant progress for predictive modeling to study catalyst performance in the oxidative-coupling-of-methane (OCM) reaction. Two main constraints need to be noted within this research framework. We did not validate this approach to work on competing OCM and other types of catalytic reactions and systems showing different reactions. The ACPD shows promising features for scalability while requiring additional adjustments when dealing with processes having different types of data or chemical reactions. The ability to understand the results clearly represents an additional major technical obstacle. The interpretive power of SHAP analysis requires massive amounts of computer processing power, although it provides excellent transparency of machine learning prediction processes. Operations involving extensive datasets along with numerous features lead to SHAP becoming resource-intensive and producing a slow performance. Applying these models on an industrial scale faces substantial difficulty from this aspect.
The quality of the data remains the main limitation on the model’s accuracy, although solid optimization and well-planned feature selection improve its performance. The datasets of the published research studies we use contain ample information but might introduce inconsistencies because the experimental methods differ between papers while potential publication bias could exist too. To reach industrial application levels, this framework needs development for more reaction types and additional experimental data validation together with quick and scalable methods for model interpretation.
5. Materials and Methods
The pursuit of optimizing catalytic processes, particularly in enhancing C2 yields through the oxidative coupling of methane (OCM), requires an integrated and scalable approach that combines advanced machine learning techniques, innovative feature engineering, and robust interpretability frameworks. Our methodology was structured to address this challenge holistically, encompassing data preprocessing, creative representation of catalyst compositions, rigorous model evaluation, hyperparameter tuning, and interpretability analysis.
As illustrated in
Figure 12, the foundation of our approach is based on the development of the Aggregated Catalyst Physicochemical Descriptor (ACPD). This scalable and robust feature engineering technique captures the nuanced contributions of catalyst components. This innovation ensures that the representation of catalyst compositions remains comprehensive yet computationally efficient, enabling effective downstream modeling.
Building on this foundation, we evaluated a diverse array of regression models to identify the most predictive and reliable approaches. The evaluation was coupled with Modified Sequential Model-Based Optimization (SMBO), an advanced hyperparameter tuning strategy, to ensure optimal model performance. To provide a reliable and unbiased assessment, we employed the Cross-Validation with Stratified Regression (CVSR) technique, which accounts for variations in the dataset while maintaining representative splits. Model interpretability is a cornerstone of our methodology. Using frameworks such as SHAP, we delved into the Catalyst Anatomy, uncovering the intricate interplay between catalyst components and operational conditions. This interpretability analysis not only validated model predictions, but also provided deep insights into the underlying mechanisms influencing C2 yields. This comprehensive methodology bridges the gap between predictive accuracy and mechanistic understanding, paving the way for the rational design and optimization of catalysts in OCM processes. By integrating data-driven techniques with interpretability frameworks, we aimed to elucidate the complex dynamics of catalytic systems and advance the science of catalyst innovation.
5.1. Dataset
Our study was based on a vast reference source [
26] that first collected more than 1870 papers published within 3 decades of OCM investigation. This database contains a wide variety of catalyst compositions, comprising 68 different types of catalytic components, including 61 cations and seven anions, which serve as active site promoters and support materials that provide a range of catalytic capabilities. However, to minimize data variation caused by variable stoichiometry of elements in the OCM reaction, oxygen was excluded from these elements. Additional structuring of the dataset was carried out by Schmack et al. [
23], who subdivided the elements based on their functions in the catalyst compositions. It provided a systematic approach to analyzing the effect of the elemental composition of catalysts on the yield and selectivity of the OCM process, establishing a 68-dimensional feature space for predictive purposes. The primary focus of this dataset was the C2 yield (YC2), a crucial measure of the catalyst’s efficiency in OCM reactions. YC2 was applied with precision parallel to critical reaction parameters, like catalyst composition, reaction conditions involving temperature, CH
4 and O
2 partial pressures, total pressure and contact time, and additional performance metrics of O
2 conversion, CH
4 conversion, and COx, ethane, and ethylene selectivity. It also afforded a rich analysis of the relationship between the catalytic properties and the resulting C2 yield. This study was based on an extensive dataset derived from the pivotal research on the oxidative coupling of methane (OCM), as well as three additional datasets. These datasets collectively enhanced the scope and robustness of the analysis.
The dataset presented in [
26] is the largest and forms the foundational basis of our study. It comprises over 1870 papers spanning three decades, documenting 68 different catalytic components, as shown in
Figure 13a (61 cations and 7 anions), utilized as active site promoters and supports. The dataset focuses on critical reaction parameters, including methane and oxygen partial pressures, reaction temperature, total pressure, and selectivity metrics for C2 yield. While comprehensive, this dataset is literature-dependent, emphasizing the interplay between catalyst composition and performance.
Dataset B [
2] expands upon the literature-based findings in dataset A. It incorporates 4759 experimental data points and emphasizes 74 catalytic elements, as shown in
Figure 13b, using machine learning methods to extrapolate trends and identify potential novel catalysts. This dataset is partially dependent on dataset A, as it draws upon overlapping literature sources but also integrates additional experimental insights up to 2019. Its unique value lies in its ability to leverage machine learning to predict trends in catalytic performance.
This dataset [
23] focuses exclusively on high-throughput experimentation. It includes detailed experimental data comprising 12,708 data points for 59 catalysts across 3 successive operations, encompassing 27 distinct catalytic elements, as shown in
Figure 13c. Metrics such as methane conversion rates, C2 selectivity, and operational conditions (temperature, pressure) are systematically documented. Dataset C’s independence from datasets A and B ensures that it provides experimental validation for trends identified in the literature-based datasets, enriching the robustness of the overall analysis.
5.2. Dataset Preprocessing
The datasets collectively provided a comprehensive range of variables, including catalyst composition (cations, anions, and support materials), reaction conditions (temperature, pressure, methane, and oxygen partial pressures), and performance metrics, such as C2 yield (Y (C2), %), CH4 conversion, and selectivity for ethane and ethylene. Preprocessing involved pivotal transformations of catalyst compositions and support materials to align molar percentages with corresponding material identifiers. Each catalyst was represented as a row, while elements were treated as columns, with non-contributing elements filled with zero. The aggregation of composition data ensured compatibility and facilitated analysis. Additionally, synthesis methodologies were captured using binary encoding for preparation methods. The importance of Y (C2) as a target variable was emphasized by retaining the continuous distinctions in concentrations of Y (C2) and percentage to better understand the interactions occurring in the catalytic system. This approach captures the fine details in yield outcomes, enabling the model to account for small fluctuations that are crucial for determining the ultimate performance of catalysts. Retaining Y (C2), % as a continuous variable added depth to the model’s interpretation, ensuring that the predictions reflected the nuanced behavior of the catalytic systems. These preprocessing steps ensured standardized and consistent data, enabling robust analysis and the optimization of predictive modeling frameworks for OCM processes.
5.3. Aggregated Catalyst Physicochemical Descriptor (ACPD)
In the study by Mine et al. [
2], the authors aimed to identify novel catalysts for the oxidative coupling of methane (OCM) by integrating physicochemical descriptors of elements with existing datasets. They employed an extrapolative machine learning approach, where each catalyst was represented by its constituent elements’ descriptors. Specifically, for a catalyst composed of elements
with corresponding weight percentages
, and each element characterized by descriptors
,
,…,
, features were created by multiplying each element’s weight percentage by its descriptor value. This resulted in features such as
,
and so forth for each element and descriptor. Consequently, the number of features scaled with both the number of components and the number of descriptors. For instance, with 5 elements and 20 descriptors, this method would generate 100 features, each representing the contribution of an element’s descriptor to the catalyst’s characteristics.
The ACPD is a smart way to simplify the representation of catalysts in machine learning, particularly in catalytic chemistry. Typically, each element in a catalyst is characterized by several physical and chemical properties, resulting in a vast and complex dataset. This complexity often causes models to overfit—performing well on known data but struggling with new, unfamiliar examples. The ACPD helps solve this by combining all those detailed properties into a smaller, more manageable set of values. It works by taking a weighted average of each property based on the proportion of each element present in the catalyst. This provides a clearer picture of the overall catalyst rather than analyzing each aspect separately. As a result, the data become easier to work with, and the models built on them are usually more straightforward, more accurate, and easier to interpret. In practice, the ACPD calculates each combined value by multiplying an element’s property by its weight percentage, adding them up, and adjusting for the total. Altogether, the ACPD strikes a good balance between being detailed and staying efficient.
While this method provides detailed insights into individual element contributions, it is not scalable. The number of features changes with the number of involved elements, leading to a high-dimensional feature space that can complicate the model and potentially cause overfitting. To address this scalability issue, our updated methodology introduced an innovative aggregated approach to represent physicochemical descriptors at the catalyst level. This method reduces dimensionality, enhances scalability, and retains essential information based on the following formulas:
Aggregated Descriptor Calculation: For a catalyst composed of elements
with weight percentages
…
, and each element characterized by descriptors
, the aggregated descriptor
for descriptor
is calculated as:
where:
- -
n is the number of elements in the catalyst.
- -
ωi is the weight percentage of element .
- -
is the value of descriptor for element .
This equation ensures that each descriptor, , represents a weighted average of the contributions of all constituent elements in the catalyst, normalized by their total weight percentages.
Categorical Descriptors Representation: For categorical descriptors, such as ‘group’ and ‘period’ in the periodic table, one-hot encoding is applied. Each element’s presence in the catalyst is indicated by setting the corresponding group and period features to 1, while non-contributing elements are set to 0. This encoding effectively captures categorical properties without unnecessarily increasing dimensionality.
Final Feature Set: The final feature set for each catalyst includes:
- -
Aggregated physicochemical descriptors .
- -
One-hot encoded categorical features for ‘group’ and ‘period’.
This approach offers several key innovations and advantages:
Scalability: By reducing the number of features to a fixed set of aggregated descriptors, the method is independent of the number of elements in the catalyst, ensuring scalability for large datasets.
Simplicity: The representation of catalyst compositions is simplified, making it easier to interpret and analyze.
Robustness: Aggregating descriptors ensures that the feature space captures essential information without being overly sensitive to specific element inclusion or exclusion.
Preservation of Nuances: The weighted averaging process retains fine details of individual element contributions, enabling accurate and insightful predictive modeling.
By implementing this aggregated approach, we achieved a scalable and robust representation of catalyst compositions that enhanced the predictive power of machine learning models while maintaining interpretability and reducing the risk of overfitting.
5.4. Different Model Evaluations
In pursuit of the overarching goal to accurately predict the C2 yield (Y (C2), %) from the oxidative coupling of methane reactions, our study embarked on a comprehensive evaluation of a diverse array of regression models. This evaluation was conducted with the intention of identifying the model or models that exhibited the highest predictive accuracy and reliability. The regression models assessed included a wide range of algorithms, each with unique characteristics and assumptions about the underlying data. The models evaluated are as follows:
Random Forest is an ensemble learning method that builds multiple decision trees during training and combines their outputs to improve predictive accuracy. It excels in capturing nonlinear relationships and is robust against overfitting, making it suitable for datasets with high-dimensional features and complex interactions.
XGBoost is an efficient implementation of gradient-boosted decision trees designed for speed and performance. It employs regularization techniques to enhance generalization and is particularly effective in handling imbalanced data, making it a powerful choice for predictive tasks with intricate feature interactions.
LightGBM is a gradient-boosting framework that uses tree-based learning algorithms. Known for its efficiency and scalability, it handles large datasets with minimal memory consumption and supports advanced features, such as leaf-wise tree growth, for improved accuracy.
Extra Trees is another ensemble method similar to Random Forest but introduces randomness by selecting splits randomly during tree construction. This characteristic helps reduce overfitting and increase model diversity, making it a robust alternative for complex datasets.
Gradient Boosting builds models sequentially, where each new model corrects the errors of its predecessor. It is effective in optimizing prediction accuracy and is well-suited for datasets with nonlinear relationships and interactions among features.
These algorithms were evaluated using a 10-fold cross-validation strategy to ensure robustness and minimize bias. The results of this evaluation are presented in the subsequent sections, highlighting the models’ predictive capabilities and suitability for analyzing OCM catalyst performance.
5.5. Cross-Validation with the Stratified Regression (CVSR) Technique
Building on the foundation laid in the initial experiments, we introduced the Cross-Validation with Stratified Regression (CVSR) technique to refine our best-performing model. This novel approach integrates the concept of stratification with regression tasks, ensuring a balanced representation of the target variable’s distribution across each fold in cross-validation.
CVSR operates on the principle of stratified sampling, where the dataset is divided into mutually exclusive folds, with each fold maintaining a similar statistical distribution of the target variable, . Let represents the fold, such that and for .
For a regression model, M, the CVSR process is formalized as follows:
Stratification: For each fold, Di, ensure that the distribution of Y within Di closely mirrors the overall distribution of Y in D.
Training and Validation: For each iteration, i, train M on D/Di and validate on Di. This produces a prediction set Pi for Di.
Aggregation: Aggregate the prediction sets {P1, P2, …, Pk} to form the comprehensive prediction set P for D, which is then used to assess the model’s performance using metrics such as R2, MSE, and RMSE.
This technique ensures each validation fold is a stratified sample of the entire dataset, which is particularly beneficial for datasets with non-uniform distributions of the target variable. By preserving the target distribution in each fold, CVSR enhances the reliability and generalizability of the model’s performance metrics.
5.6. Modified Sequential Model-Based Optimization (SMBO) for Hyperparameter Tuning
This study employed a Modified Sequential Model-Based Optimization (SMBO) approach to optimize hyperparameters of the best model performed in experiment 3, which is an Extra Trees classifier for datasets X, Y, and LightGBM for dataset Z. Unlike traditional SMBO, our method optimized each parameter sequentially while keeping others fixed at their best-known values. This sequential strategy enabled us to evaluate the contribution of each parameter to model performance, striking a balance between efficiency and the ability to explore distinct parameter configurations. Let denote a parameter to optimize (e.g., max_depth, n_estimators, etc.). Define the function , where represents all other parameters held at their current optimal values. This function, , computes the model’s R2 score for each trial based on , while keeping fixed.
By iteratively maximizing for each parameter, , we sequentially converge toward a configuration that achieves the best model performance.
For each parameter,
, we perform the following update at iteration k:
where
k represents each iteration in which a single parameter is optimized, updating the parameter set toward the best configuration.
5.7. Performance Metrics
The performance of each model was evaluated using three key metrics: the coefficient of determination (R2), the Mean Squared Error (MSE), and the Root Mean Squared Error (RMSE). These metrics were chosen to provide a holistic view of the models’ performance:
R2 (Coefficient of Determination): Indicates the proportion of variance in the dependent variable that can be predicted from the independent variables. A higher R2 signifies a better model performance.
MSE (Mean Squared Error): Measures the average squared difference between estimated values and actual values. A lower MSE indicates a better fit.
RMSE (Root Mean Squared Error): The square root of MSE, providing a scale-relative measure of error. A lower RMSE suggests a more accurate model.
The results of this comprehensive evaluation are discussed in the subsequent sections, highlighting the models that demonstrated superior performance across the relevant metrics. This rigorous evaluation process is instrumental in guiding the selection of the most suitable model(s) for predicting the C2 yield in OCM reactions, thus contributing significantly to the advancement of the catalysis research.
5.8. Catalyst Anatomy via SHAP Explainer
In the pursuit of not just predicting, but also understanding, the intricate dynamics that drive the C2 yield in oxidative-coupling-of-methane (OCM) reactions, our research delved into the ‘anatomy’ of the catalyst using SHapley Additive exPlanations (SHAP) Explainer. This approach facilitated a deeper exploration into how each feature contributed to the model’s predictions, thereby unveiling the complex interplay between catalyst components and their impact on the target variable. SHAP values, grounded in game theory, offer a robust framework for interpreting the predictions of machine learning models. By attributing the model’s output to its input features, SHAP values provide a detailed explanation of the contribution of each feature to individual predictions, offering a granular view of the model’s behavior. This methodology is particularly valuable in the catalysis research, where understanding the specific role of each catalyst component can lead to significant insights and breakthroughs.
5.8.1. Summarizing Feature Effects
The SHAP Explainer was employed to summarize the effects of all features on the target variable. This was achieved by aggregating the SHAP values for each feature across the dataset, thereby providing a global perspective on the importance and impact of each feature. The summary plot generated by SHAP offered a visual representation of feature importance ranked by the magnitude of their impact on the model’s output, alongside the distribution of the effects each feature had on the model’s predictions.
5.8.2. Catalyst Feature Contributions
To delve into the specifics of how individual features contributed to particular predictions, we utilized SHAP’s waterfall plots on different samples. These plots offer a step-by-step decomposition of the prediction, starting from the base value (the model’s average prediction across the dataset) and sequentially adding the effect of each feature. By analyzing waterfall plots for various records, we were able to identify the unique contribution patterns of features in specific instances, providing an insight into the variability of feature contributions across different catalyst compositions and operational conditions.
5.8.3. Relationship Exploration
Focusing on the most crucial feature identified through the global summary, we further explored its relationship with the target variable using SHAP’s dependence plot. This plot illustrates how the SHAP value (impact on model output) of a given feature varies with the feature’s actual value, highlighting potential nonlinear relationships or interactions with other features. By applying the dependence plot to the most influential feature, we gained a nuanced understanding of how this key catalyst component influences the C2 yield, potentially uncovering non-intuitive relationships and interaction effects that contributed to the catalyst’s performance. The application of SHAP Explainer to our model offered profound insights into the ‘anatomy’ of the catalyst, illuminating the specific roles and relative importance of various catalyst components and operational parameters. This analytical depth extends beyond mere prediction, providing actionable intelligence that guides the design and optimization of catalysts for improved performance in OCM reactions. Through this meticulous exploration, we not only enhanced the transparency and interpretability of our predictive model but also contributed to a deeper understanding of catalytic processes, laying the groundwork for future innovations in the field.
6. Conclusions
This research journey into predictive modeling for oxidative-coupling-of-methane (OCM) processes provided valuable insights into the complexities of catalytic systems. By integrating advanced machine learning techniques with robust analytical frameworks, our study not only enhanced predictive accuracy, but also uncovered the intricate dynamics underlying catalytic reactions. Central to our methodology was the development of Aggregated Catalyst Physicochemical Descriptors (ACPDs), which streamlined feature representation, and the use of stratified regression, which improved model generalizability. These innovations addressed key challenges in handling complex and diverse datasets, enabling the creation of highly accurate and scalable predictive models. Through hyperparameter optimization using Modified Sequential Model-Based Optimization (SMBO), we further refined the models, achieving exceptional performances across all datasets. Notably, dataset B, with its diverse elemental composition, highlighted the robustness of our approach, achieving superior R2 values compared to prior studies. This success underscores the adaptability of our methodology to varied data contexts, making it a versatile tool for catalysis research. Complementing these advancements, SHAP (SHapley Additive exPlanations) analysis provided a detailed interpretive framework to dissect the contributions of individual features to model predictions. By leveraging visual tools, such as scatter plots, decision plots, and feature importance analyses, we bridged the gap between computational predictions and the fundamental principles of catalysis. This analysis illuminated how catalyst components and operational variables interact to influence performance, offering actionable insights for experimental validation and catalyst design.
In conclusion, this study represents more than an incremental refinement of predictive models. It signifies a step forward in the integration of machine learning with catalytic science, fostering a synergistic relationship between computational tools and experimental chemistry. By enhancing the understanding of catalytic processes and improving predictive accuracy, our research lays a foundation for achieving higher efficiencies and sustainability in chemical manufacturing. This convergence of data-driven approaches with fundamental scientific inquiry sets the stage for future innovations in catalyst development and process optimization, driving the field of catalysis toward a more informed and iterative scientific paradigm.