A Machine Learning Model for Predicting the Propagation Rate Coefficient in Free-Radical Polymerization

Wang, Yiming; Fang, Yue; Zhou, Haifan; Gao, Hanyu

doi:10.3390/molecules29194694

Open AccessArticle

A Machine Learning Model for Predicting the Propagation Rate Coefficient in Free-Radical Polymerization

Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, China

^*

Author to whom correspondence should be addressed.

Molecules 2024, 29(19), 4694; https://doi.org/10.3390/molecules29194694

Submission received: 6 September 2024 / Revised: 27 September 2024 / Accepted: 30 September 2024 / Published: 3 October 2024

(This article belongs to the Special Issue Deep Learning in Molecular Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

The propagation rate coefficient (k_p) is one of the most crucial kinetic parameters in free-radical polymerization (FRP) as it directly governs the rate of polymerization and the resulting molecular weight distribution. The k_p in FRP can typically be obtained through experimental measurements or quantum chemical calculations, both of which can be time consuming and resource intensive. Herein, we developed a machine learning model based solely on the structural features of monomers involved in FRP, utilizing molecular embedding and a Lasso regression algorithm to predict k_p more efficiently and accurately. The result shows that the model achieves a mean absolute percentage error (MAPE) of only 5.49% in the predictions for four new monomers, which indicates that the model exhibits strong generalization capabilities and provides reliable and robust predictions. In addition, this model can accurately predict the influence of the ester side chain length of (meth)acrylates on k_p, aligning well with established scientific knowledge. This approach offers a straightforward and practical model for other researchers to rapidly obtain accurate k_p values by employing monomer structural information. The model is sufficiently general to apply to a wide range of (meth)acrylate and butadiene FRP monomers, thereby supporting kinetic modeling of polymerization reactions.

Keywords:

propagation rate coefficient; SMILES; lasso regression; Molecular Transformer embeddings

Graphical Abstract

1. Introduction

Kinetic rate coefficients in free-radical polymerization (FRP) are crucial in polymerization modeling and optimization, which can lead to the design and synthesis of novel materials [1]. These parameters are an integral part of a microkinetic model of polymerization (including chemical reaction mechanisms at the molecular level and differential equations describing concentration changes over time), which can predict important reaction performance metrics including monomer conversion, molecular weight distribution, and dispersity. Among these key kinetic parameters, the propagation rate coefficient (k_p) holds particular significance since k_p governs the overall rate of the polymerization reaction and reflects the inherent reactivity of the monomer so that it can be used to model the polymerization behavior of different monomers [2].

The k_p values are typically obtained through a variety of approaches, including direct and indirect experimental measurements as well as quantum chemical calculations [3,4,5]. Direct experimental measurements of the k_p are typically conducted via the pulsed laser polymerization–size exclusion chromatography (PLP-SEC) technique, based on the rapid and periodic formation of primary radicals induced by high-frequency laser pulses [6,7]. Kinetic modeling and regression analysis serve as indirect experimental methods to predict k_p from measured temporal change of concentration data [8]. In addition to these experimental approaches, quantum chemical calculations employing high-level ab initio molecular orbital theory have been utilized to provide mechanistic insights and estimate k_p values [3].

However, these methods have their own limitations. For the direct experimental method, PLP-SEC is the standard method recommended by the IUPAC Working Party for the experimental determination of kinetic parameters such as the k_p, activation energy (E_A), and pre-exponential factor (A) in FRP [9]. This technique employs laser pulses to periodically induce radical formation in monomer solutions, resulting in characteristic peaks in the polymer molecular weight distribution that enable the determination of the k_p. By measuring k_p at multiple temperatures using this approach, the Arrhenius equation for the propagation of that monomer can be derived. Guided by this principle, Buback et al. have determined the Arrhenius equation for the k_p of styrene, while Beuermann et al. have carried out similar work for methyl methacrylate, ethyl methacrylate, and dodecyl methacrylate [7,10,11]. However, these measurements can be quite complex and time consuming, and reliable k_p values require IUPAC harmonization of data submitted from different laboratories. Kinetic modeling and regression analysis are indirect experimental approaches to computationally determine the k_p in FRP, including deterministic methods and stochastic approaches. For instance, Zhou et al. utilized the method of moments to model the apparent k_p for the polymerization of butyl methacrylate and methyl methacrylate at 25 °C [5], and the results were found to be consistent with PLP-SEC data. Marien et al. developed an isothermal kinetic Monte Carlo model that accounts for all relevant elementary reactions to accurately simulate the complete PLP-SEC traces and extract the k_p for n-butyl acrylate [8]. However, this is essentially a data-fitting process that still relies on experimental measurements of the dynamic changes of concentrations, which lacks predictive power. Apart from these experimental methods, quantum chemical calculations coupled with transition state theory and high-level ab initio molecular orbital theory have been employed to calculate intrinsic rate coefficients of FRP. For example, Heuts et al. utilized quantum chemical calculations to determine the geometries, energies of reactants, vibrational frequencies, and the transition state, and then applied transition state theory to obtain the Arrhenius parameters [3]. The calculated results for ethylene propagation were in good agreement with PLP-SEC data, although this approach is limited to propagation reactions that are not significantly influenced by the presence of solvents. Huang et al. calculated the k_p values for acrylonitrile and methacrylonitrile, and investigated the hindering effect of the methyl substituent on the rotational degrees of freedom in the transition state through quantum chemical calculations [12]. However, the accuracy of k_p values obtained from quantum chemical calculations depends on the precise description of the energetic and entropic barriers, and as the number of atoms in the monomer molecules increases, the computational cost and time required would increase dramatically.

Recently, researchers have turned to the use of machine learning methods to predict k_p values. Reydt et al. achieved good fitting results using machine learning for (meth)acrylate-type FRP monomers [13]. They utilized GAMESS and ChemSpider to obtain various physicochemical properties of the monomers, including dipole moment, boiling point, melting point, surface tension, refractive index, and polarizability, and used these as features in a ridge regression model to predict k_p. While the model showed reasonable predictive performance for (meth)acrylate monomers, its ability to predict kinetic parameters for other FRP monomers, such as butadiene-type and styrene-type, was limited. Furthermore, the complexity of their features, which required computational software (GAMESS, version: 2018, R1) to obtain, also restricted the generalization capability of their model in predicting k_p for new monomers. Recently, Shi et al. developed a quantitative structure–property relationship model based on density functional theory and machine learning regression analysis to predict Arrhenius parameters and k_p, achieving high accuracy [14]. However, this machine learning approach still requires complex quantum chemical calculations, which are time consuming. Therefore, it is desirable to develop a generalized, accurate, and computationally efficient machine learning model for the prediction of k_p.

Herein, we propose a machine learning model based on the molecular structure of monomers to predict k_p values and Arrhenius parameters. Firstly, the PubChem database was utilized to obtain the Simplified Molecular-Input Line-Entry System (SMILES) representations of various monomers [15]. Subsequently, the SMILES was converted into both Molecular ACCess System (MACCS) fingerprints and Molecular Transformer embeddings [16,17], and these two types of fingerprints were then combined as the input features. The features were used to predict k_p using a Lasso regression model with a regularization term [18], which enabled automated feature selection of more influential variables while also preventing overfitting. To validate the generalizability of our model, monomers from external datasets were tested using this model. We demonstrated that by simply using molecular fingerprints derived from 2D molecular graphs (or the equivalent SMILES representation), the propagation rate coefficients can be predicted for (meth)acrylate and butadiene monomers with high accuracy, which can strongly contribute to the simulation of polymerization reactions and design of polymerization systems.

2. Results and Discussion

The fitting performance of the four regression models was first evaluated on the training set of 41 monomers under the same standards as shown in Table S1. Furthermore, the predictive capability of the trained models was compared on monomers outside the training set to select the models with the strongest generalization ability. Finally, reasonable predictions were also made for the Arrhenius pre-exponential factor A and activation energy E_A to predict the k_p values of new monomers at different temperatures.

2.1. Comparison of Four Regression Models on the Training Dataset

The use of ln(k_p) values at 25 °C was adopted for convenient comparison across the different models. The use of the natural logarithm of the rate constant, rather than the raw k_p values, was employed to avoid the models predicting unphysical negative values. This approach also enhanced the statistical validity of the regression analysis, resulting in more robust and reliable models.

As shown in Figure 1, the fitting results of the four regression models all exhibited high quality, with coefficient of determination (R², Equation (S1)) values exceeding 0.9978 and root-mean-square errors (RMSE, Equation (S2)) less than 0.1000. This demonstrates the excellent fitting performance of these regression models on the training dataset and also indicates that the feature transformation process effectively preserved the majority of the structural information.

Figure 2 presents the fitting performance using the features selected by Reydt et al. [13], including molecular weight, polarity, and boiling point. Figure 2a shows the results for the same 41 monomers, while Figure 2b displays their fitting on a training set comprising (meth)acrylates, styrene, and acrylonitrile, after excluding monomers that could not be adequately described by their feature set. In the latter case, their model achieved an R² of 0.9855 and an RMSE of 0.2269, still inferior to any of the four regression models presented here.

To provide a more intuitive comparison of the prediction errors across the four models, the absolute percentage error (APE, Equation (S3)) between the predicted and actual k_p values was also examined.

As depicted in Figure 3, consistent with the results evaluated using R² or RMSE, the multivariate linear regression model exhibited the most favorable performance, with over 70% of the monomers having APEs below 3%, further corroborating the efficacy of the selected features. In contrast, the ridge regression model showed the poorest performance, with nearly 50% of the monomers having APEs greater than 6%, indicating that the addition of the regularization penalty in ridge regression was not well suited for this training set. The Bayesian ridge regression, which combined the characteristics of ridge and Lasso regressions, yielded slightly inferior fitting results on the training set compared to Lasso regression, suggesting that the regularization term in Lasso regression was better able to capture the inherent relationship between the features and ln(k_p). However, given the extremely limited data points, the superior performance on the training set may be attributed to overfitting, which could lead to poor predictive capabilities for new monomers. Therefore, the four models were evaluated on a separate test dataset of new monomers.

2.2. Construction of the External Test Dataset

In order to test the generalizability of the model, we obtained a few more data points from the literature outside the training set. As shown in Table 1, the kinetic data for dodecyl acrylate, tridecyl acrylate, tert-butyl methacrylate, and chloroprene were obtained from four different experimental laboratories to reduce the overall impact of experimental errors associated with PLP-SEC measurements on the evaluation of the model [19,20,21,22]. Notably, the data reported for tridecyl acrylate were actually obtained from a sample of Tridecyl N acrylate, which is a combination of isomers of tridecyl acrylate with partially esterified side chains [20]. In contrast, the literature reported that tridecyl A acrylate represents a distinct set of tridecyl acrylate isomers. Nevertheless, the k_p value of tridecyl N acrylate is larger than that of dodecyl acrylate, while the k_p value of tridecyl A acrylate is smaller than that of dodecyl acrylate. Based on the general trend that longer acrylate ester side chains lead to larger k_p values [4], we inferred that the tridecyl N acrylate monomer had a relatively lower degree of side chain branching, and thus considered its k_p value as a reference for tridecyl acrylate (TDA). Furthermore, the rationality of the model predictions was then interpreted using the scientific principles of FRP.

2.3. Comparison of Four Regression Models on the Test Dataset

As shown in Figure 4, mean absolute percentage error (MAPE, Equation (S4)) was used to evaluate the overall predictive capability of the models on the four new monomers, while the individual APE values for each of the four monomers could detect the predictive capability of the models on different types of monomers.

As depicted in Figure 4c,d, not only did ridge regression and Bayesian ridge regression exhibit poor performance on the training set, but they also showed relatively inferior performance on the four new monomers, with MAPE values of 30.00% and 30.83%, respectively. The rationale behind this is that the regularization method employed in ridge regression is more appropriate for situations characterized by multicollinearity among the predictor variables. Conversely, if the feature set comprises orthogonal predictor variables, ridge regression may not demonstrate optimal performance. This suggests that the sub-structural features of monomers have independent impacts on the final k_p values in FRP.

Additionally, the simplest direct multivariate linear regression model also exhibited acceptable results, with a MAPE of only 11.60% on the test dataset. This indicates that the features we inputted were sufficient to represent the monomer sub-structures that influence the chain propagation rate. However, the multivariate linear regression model failed to correctly validate the scientific principle that longer acrylate ester side chains lead to larger k_p values [4]. For example, at 25 °C, the predicted k_p values for ethyl acrylate and propyl acrylate were 21,713 and 13,007 L mol⁻¹ s⁻¹, respectively. This suggests a lack of a regularization penalty, which may have caused a certain degree of overfitting on the training set.

Lasso regression showed the best test results, with a MAPE of only 5.49% for the four new monomers, and each monomer’s prediction bias was less than 8%. The rationale is that Lasso regression minimizes the sum of the absolute values of the regression coefficients, enabling it to shrink the coefficients of secondary variables to exactly zero while retaining the primary variables that influence the k_p values. Consequently, during the training process, the Lasso regression model can automatically select the key sub-structural features of the monomers that impact the k_p values, such as methyl substitutions on carbon–carbon double bonds, halogen substituents, and ester side chain lengths.

Overall, with the SMILES representations of the monomers as the initial features and proper feature engineering, Lasso regression outperforms any current ab initio calculations and the machine learning model of Reydt et al. [13], which used partial monomer properties as features, in predicting k_p.

2.4. Reflection of Scientific Principles

In the field of FRP, two well-established scientific laws have been validated. The first law states that for linear alkyl (meth)acrylates, the k_p gradually increases as the number of carbon atoms in the ester side chain increases [4,20]. The second law suggests that for (meth)acrylate esters with the same side chain, the k_p of the acrylate ester is two orders of magnitude higher than that of the corresponding methacrylate ester [13]. To check against these two postulates, the optimal Lasso regression model was employed to predict the k_p values of several new monomers, and the results are presented in Table 2.

The predictive results show that as the number of side-chain carbon atoms increases from 12 to 15 for dodecyl acrylate, tridecyl acrylate, tetradecyl acrylate, and pentadecyl acrylate, the k_p values indeed gradually increase from 17,682 to 22,635. Furthermore, the k_p value of Tetradecyl acrylate (22,034) is two orders of magnitude higher than that of Tetradecyl methacrylate (611), in accordance with the established scientific law. The predictive results that align with these scientific laws demonstrate the significant potential of our model in accurately and reliably predicting the k_p values of new monomers.

2.5. Predictions of k_p at Multiple Temperatures, A, and E_A

As in the manner of obtaining the Arrhenius parameters from the PLP-SEC experiments, k_p values of monomers at different temperatures were simultaneously predicted and linearly fitted to yield the A and E_A. Specifically, the k_p values at 15 °C, 25 °C, 35 °C, 45 °C, 55 °C, and 65 °C were used as labels to train six well-fitted models, as shown in Figure S9. The R² values for these models were all above 0.9980, and the RMSEs were all below 0.0800. Subsequently, the A and E_A parameters were obtained by linearly fitting the k_p values at the six temperatures according to the Arrhenius equation, and the results of the training set are presented in Figure 5.

In contrast to the optimization of a single predictive model, the error associated with forecasting the parameters A and E_A using the ensemble of six models will exhibit a marginal increase. However, this is still far superior to the results reported by Reydt et al. [13]. Our E_A (R² = 0.9932, RMSE = 0.3961) and ln(A) (R² = 0.9714, RMSE = 0.1601) outperform their E_A (R² = 0.9630, RMSE = 0.8230) and ln(A) (R² = 0.6660, RMSE = 0.4270). This further demonstrates the effectiveness of our model in directly using SMILES structures for prediction.

As shown in Table 3, the predicted Arrhenius parameters E_A and A for the four new monomers on the test dataset are also provided. First, the six well-fitted models obtained from the training set were employed to predict the k_p values of the four new monomers at the six temperatures. Subsequently, the E_A and A parameters were derived by linearly fitting the predicted k_p values according to the Arrhenius equation.

The errors between the predicted and experimental values are illustrated in Figure 6. For the E_A predictions, the MAPE for the four new monomers was only 11.37%, and the APE was below 18% for all monomers. This further demonstrates the high accuracy and robustness of the model in predicting k_p values. In contrast, the prediction error for the A values was relatively larger, with a MAPE of 59.48%.

However, it is important to note that the PLP-SEC experiments used to measure k_p values inherently have an experimental uncertainty of 10–15% [23]. While this error is orthogonal to model prediction errors (our model is trained to predict the single-point experimental k_p values without considering these experimental uncertainties), it is reasonable to conclude that an error level of 10–15% in k_p values is practically acceptable. The A values are derived from fits of k_p values across multiple temperatures, and in this fitting process, A values are very sensitive to small changes in E_A values. Considering this, the prediction error for the A parameter is deemed acceptable in the context of FRP modeling.

Although our model is capable of providing reliable k_p value predictions for new (meth)acrylate and butadiene FRP monomers, limitations still exist. The accuracy of the k_p predictions may decline when certain sub-structural features of monomers are absent from the training set. Therefore, to enhance the model’s generalization capability, it is essential to expand the dataset to include a broader variety of FRP monomers. Furthermore, due to the inherent uncertainty in the PLP-SEC experimental data, although a small APE is exhibited by the predicted results compared to the PLP-SEC data, the APE in relation to the objectively true k_p values may be larger.

3. Methods

3.1. Construction of the Training Dataset

The dataset of kinetic parameters for the monomers of FRP used to train our model was presented in Tables S1 and S2. These data have been curated by Reydt et al. [13]. The majority of the data originate from the benchmark datasets recognized by IUPAC, while the remaining data come from individual laboratory experiments that are also considered reliable under IUPAC standards. The k_p values at different temperatures were given by the Arrhenius equation:

k_{p} = A \exp (- \frac{E_{A}}{R T})

(1)

where A is the pre-exponential factor, E_A is the activation energy for propagation reactions, R is the universal gas constant, and T is the absolute temperature.

3.2. Feature Representation

3.2.1. Scientific Understanding

Reydt et al. achieved satisfactory model fitting results in their development of machine learning models for the prediction of k_p by using intrinsic properties of the monomers as features, such as molecular weight, dipole moment, boiling point, melting point, and dissociation constant [13]. Specifically, molecular weight can indirectly influence k_p by reflecting changes in the length of the ester side chains in (meth)acrylate monomers. Dipole moment captures differences in polarity, which can directly impact both the reactivity in free-radical reactions and the solvation environment during bulk polymerization, thereby affecting the rate of propagation. The other features serve as complementary descriptors of the monomers’ molecular properties.

However, even though they identified the key molecular properties of monomers that influence the propagation rate, it remains challenging to provide a comprehensive representation of the molecular properties affecting k_p. Other potentially influential molecular characteristics, such as the spatial environment of double bonds, may have been overlooked as features in the model due to the difficulty in their quantitative description. Furthermore, some properties affecting k_p that are not well understood could not be incorporated. In addition, when applying the model to predict the k_p of new monomers, the physical properties might be unknown (e.g., melting point), which limits the generalizability of this approach.

Since molecular properties are fundamentally determined by molecular structure, utilizing the complete molecular structure of the monomers as features for the machine learning model could resolve the issue of incomplete statistical representation of molecular properties. Herein, we used the SMILES representation of the monomer molecules as the initial features, which were then transformed into an encoding format suitable for machine learning to construct an accurate and generalizable model.

3.2.2. SMILES and MACCS Fingerprints

The SMILES, based on the principles of molecular graph theory, provides a standardized representation of molecular structures, capturing information about atoms, bonds, aromaticity, stereochemistry, and other molecular features [15]. This makes the SMILES the raw input for our model. MACCS fingerprints and Molecular Transformer embeddings are two widely adopted methods for converting SMILES representations into molecular features.

MACCS fingerprints are binary vectors of 166 bits, where each bit represents the presence (1) or absence (0) of a predefined structural fragment within the molecule [17]. However, while MACCS fingerprints capture the main atomic, bond, and functional group information of a molecule, using this fingerprint alone may not be sufficient to represent the complete structural details of FRP monomers. For instance, dodecyl acrylate and tridecyl acrylate have the same structure, except for a single carbon atom difference in the ester side chain. As shown in Table S4, a model using only MACCS fingerprints as input provided the same predicted results for these two monomers, indicating they possess completely identical sets of the 166 structural fragments and therefore have the same fingerprints.

3.2.3. Molecular Transformer Embeddings

SMILES representations of monomers were simultaneously converted into Molecular Transformer embeddings as a complement to MACCS fingerprints, to maximize the retention of molecular structural information [16]. The Molecular Transformer is a deep learning model that can generate embeddings (vector representations) from the SMILES strings of molecules, capturing their structural information. It is based on the Transformer architecture, consisting of an encoder and a decoder, and here we utilize only the encoder to obtain the molecular embeddings.

The conversion process involves tokenization, positional encoding, an embedding layer, a Transformer encoder, layer normalization, and output aggregation. During these processes, the positional encoding and the attention mechanism of the Transformer itself play an important role in preserving the structural information of small molecules [24]. The positional encoding can reflect the relative positional information of the atoms in the SMILES representation, which better captures the information of the cyclic structure and the length of side chains of the molecules. Furthermore, the global perception ability brought by the attention mechanism allows each atom or functional group to focus on any other element in the SMILES sequence, which can better represent the resonance effects, conjugation effects, and the overall structure of the molecules.

However, reliable Molecular Transformer embeddings for molecular structures require extensive training to determine the optimal settings of the Transformer architecture parameters. As shown in Figure S1, Morris et al. collected 8,300,000 molecular SMILES strings and IUPAC names from PubChem to train the Transformer model [16]. Additionally, they have made the pre-trained Transformer models publicly available, which were subsequently utilized by us to obtain embeddings for the 41 FRP monomers in our dataset. The resulting embeddings are 2D matrices with the number of rows equal to the length of the SMILES strings and the number of columns set to 512. These matrices were then averaged across the rows to produce 1D matrices with 512 columns, which were used as the input for regression models.

Ultimately, the 1D vector MACCS fingerprints and the 1D matrix of the Molecular Transformer embeddings were concatenated as the feature inputs of the 41 monomers for regression models. As illustrated in Supplementary Figure S2, SMILES strings were respectively transformed into MACCS fingerprints, Molecular Transformer embeddings, and their combination encodings, which were then used to train three separate predictive models. The results (Figures S4–S6) demonstrate that integrating these two encoding approaches can maximize the preservation of structural information during the SMILES-to-encoding conversion process.

3.3. Algorithms of Regression Models

An appropriate machine learning model needs to be employed to identify the underlying relationship between these sub-structural features and the corresponding k_p values. Our approach to predicting the k_p values of new monomers is based on the statistical correlation between structure and property. We initially explored some complex machine learning algorithms, such as XGBoost (version: 2.0.3) and LightGBM (version: 4.4.0) [25,26], to fit the training data. However, as shown in Figure S6, due to the limited dataset size of only 41 data points, these sophisticated models were unable to effectively learn the inherent relationship between the structures and the k_p values, resulting in poor fitting performance.

Therefore, it is hypothesized that a simpler multivariate linear regression method may perform better on this small dataset, as it may be able to capture the essential features of the structure–property relationship more robustly. We have compared the performance of several regression methods (Figure S7), including multivariate linear regression [13], Lasso regression [18], ridge regression [13], and Bayesian ridge regression [27].

3.3.1. Multivariate Linear Regression

Multivariate linear regression is the most straightforward and direct method, which can be expressed using the following equation:

y = β_{0} + \sum_{i = 1}^{n} β_{i} x_{i}

(2)

where y is the dependent variable, x represents the multiple independent variables, and β’s are the regression coefficients. Linear regression employs the method of least squares to estimate the values of β, thereby minimizing the sum of squared residuals between the predicted and true values. However, the simplicity of linear regression may lead to overfitting, potentially resulting in poor predictive performance on new monomers.

3.3.2. Ridge Regression and Lasso Regression

Ridge regression builds upon the foundation of multivariate linear regression by introducing a regularization penalty term in the loss function to prevent overfitting. The loss function can be expressed as follows:

L = \sum {(y - X β)}^{2} + λ \sum β^{2}

(3)

where X is the feature matrix of independent variables, and the first term

\sum {(y - X β)}^{2}

represents the sum of squared residuals between the predicted and actual values, which quantifies the model’s fit error. The second term

λ \sum β^{2}

is the product of the regularization parameter λ and the sum of squared regression coefficients. The parameter λ can be tuned to control the model complexity, thereby balancing the trade-off between variance and bias. Ridge regression is well suited for situations where there is significant multicollinearity among the predictor variables.

Lasso regression is similar to ridge regression, but the regularization term in the loss function is changed to the sum of the absolute values of the regression coefficients, which can be expressed as the following:

L = \sum {(y - X β)}^{2} + λ \sum |β|

(4)

This form of regularization can cause some of the regression coefficients to be precisely shrunk to zero, effectively performing feature selection. Therefore, Lasso regression can automatically select more influential features during the training process and discard the irrelevant ones. Lasso regression ultimately yields a sparse model, which facilitates the interpretation of individual feature influences on the target variable. This renders Lasso regression particularly well suited for scenarios involving a large number of features but a relatively small sample size.

3.3.3. Bayesian Ridge Regression

Bayesian ridge regression is a probabilistic model that extends the concept of ridge regression by incorporating Bayesian principles [27]. It provides a probabilistic approach to estimating regression coefficients, assuming a prior distribution of the coefficients and then deriving their posterior distribution after observing the data. It first assumes that the noise term β₀ in Equation (2) follows a normal distribution with a mean of 0 and a variance of σ², while the regression coefficients β are assumed to have a Gaussian prior distribution. Subsequently, it is assumed that the noise precision α, which is the inverse of σ², follows a Gamma prior distribution. Then, using Bayes’ theorem, the posterior distribution of the parameters is calculated based on the prior distributions and the likelihood function of the observed data:

p (β, α| X, y) \propto p (y| X, β, α) p (β| α) p (α)

(5)

where

p (y| X, β, α)

is the likelihood function of the data,

p (β| α)

is the prior distribution of the coefficients, and

p (α)

is the prior distribution of the noise precision. Therefore, Bayesian ridge regression provides a probabilistic interpretation of model parameters, while also enabling automatic feature selection. This imbues the method with the benefits of ridge regression’s suitability for scenarios involving significant multicollinearity among predictor variables, as well as the feature selection capability of Lasso regression.

3.4. Validation Methods

Reydt et al. have validated the effectiveness of leave-one-out cross-validation (LOOCV, Figure S8) in the analysis of regression model errors [13]. Therefore, the performance of the models on the training sets was also evaluated using this method. Specifically, multivariate linear regression and Bayesian ridge regression only used a single outer LOOCV, while ridge and Lasso regressions employed both outer and inner LOOCV. The outer LOOCV involved n fittings, each time using n − 1 monomers as the training data and the remaining one as the test data. The inner LOOCV was performed within the outer LOOCV framework, with n(n − 1) fittings to determine the optimal regularization parameter λ. Ultimately, the n model fittings from the outer LOOCV produced n predictions for each monomer in the training set. The average results of the n models in LOOCV were compared to the experimental values to evaluate the overall fitting performance, thus avoiding the influence of random errors from any single model. Similarly, when predicting k_p for new monomers, the average prediction of the n models was compared to the experimental data.

4. Conclusions

In summary, by using solely the SMILES representations of FRP monomers as input, we have developed a reliable and robust Lasso regression model that provides accurate predictions of k_p values. This model exhibits strong generalization capabilities, eliminating the need for monomer physical properties during k_p value predictions. The features in the model provide an accurate description of monomer structural information, enabling reasonable predictions as long as the sub-structural units (atoms, bonds, functional groups) of the new monomers have been encountered in the training set. Attractively, the Lasso regression model achieves a high R² of 0.9985 on the training set, and a MAPE of only 5.49% in the predictions for the four new monomers, significantly outperforming the accuracy of quantum chemical calculations as well as previously reported machine learning models. Furthermore, the influence of the ester side chain length of (meth)acrylates on k_p was accurately predicted by this model, aligning well with established scientific knowledge. This high-accuracy and highly generalizable predictive model allows other researchers to simply input monomer information and rapidly obtain reliable k_p estimates, thereby accelerating their investigations into FRP mechanisms. In the future, it is worthwhile to explore the extension of this model to other chain-growth systems, such as anionic and cationic polymerization, while ensuring the collection of sufficient kinetic data and the incorporation of solvents and initiators into features.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/molecules29194694/s1. Tables S1–S3: Training dataset. Equations (S1)–(S4): Definitions of statistical parameters. Figure S1: Diagram of Molecular Transformer embeddings. Figure S2: Diagram of SMILES processing. Figures S3–S5 and Table S4: Comparison of MACCS fingerprints, Molecular Transformer embeddings, and their combination. Figure S6: Fitting results of XGBoost and LightGBM. Figure S7: Flowchart of algorithms. Figure S8: Diagram of LOOCV. Figure S9: Fitting results at various temperatures. Tables S5–S10: Detailed results for Figure 1, Figure 4 and Figure 5. Reference [28] is cited in the Supplementary Materials.

Author Contributions

Conceptualization, Y.W. and H.G.; methodology, Y.W. and H.Z.; software, Y.W. and H.Z.; validation, Y.W.; data curation, Y.W.; writing—original draft preparation, Y.W. and Y.F.; writing—review and editing, Y.W., Y.F. and H.G.; visualization, Y.W. and Y.F.; supervision, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

Haifan Zhou is financially supported by the Hong Kong Ph.D. Fellowship Scheme (PF22-.78203). Yue Fang is financially supported by the Hong Kong Research Grants Council Early Career Scheme (26214522) and the HKUST Start-Up Grant.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code in this paper is available at https://github.com/jamesymwang/Kp-predict_MACCS-and-Molecular-Transformer (accessed on 6 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Beuermann, S.; Buback, M. Rate coefficients of free-radical polymerization deduced from pulsed laser experiments. Prog. Polym. Sci. 2002, 27, 191–254. [Google Scholar]
Nikitin, A.N.; Lacík, I.; Hutchinson, R.A. A 3D simulation investigation of the influence of temperature increases on the accuracy of propagation rate coefficients determined by Pulsed-Laser Polymerization. Macromolecules 2016, 49, 9320–9335. [Google Scholar] [CrossRef]
Heuts, J.P.; Gilbert, R.G.; Radom, L. A priori prediction of propagation rate coefficients in free-radical polymerizations: Propagation of ethylene. Macromolecules 1995, 28, 8771–8781. [Google Scholar] [CrossRef]
Kockler, K.B.; Haehnel, A.P.; Junkers, T.; Barner-Kowollik, C. Determining Free-Radical Propagation Rate Coefficients with High-Frequency Lasers: Current Status and Future Perspectives. Macromol. Rapid Commun. 2016, 37, 123–134. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.-N.; Luo, Z.-H. Copper (0)-mediated reversible-deactivation radical polymerization: Kinetics insight and experimental study. Macromolecules 2014, 47, 6218–6229. [Google Scholar] [CrossRef]
Barner-Kowollik, C.; Günzler, F.; Junkers, T. Pushing the limit: Pulsed laser polymerization of n-butyl acrylate at 500 Hz. Macromolecules 2008, 41, 8971–8973. [Google Scholar] [CrossRef]
Buback, M.; Gilbert, R.G.; Hutchinson, R.A.; Klumperman, B.; Kuchta, F.D.; Manders, B.G.; O’Driscoll, K.F.; Russell, G.T.; Schweer, J. Critically evaluated rate coefficients for free-radical polymerization, 1. Propagation rate coefficient for styrene. Macromol. Chem. Phys. 1995, 196, 3267–3280. [Google Scholar] [CrossRef]
Marien, Y.W.; Van Steenberge, P.H.; Barner-Kowollik, C.; Reyniers, M.-F.o.; Marin, G.B.; D’hooge, D.R. Kinetic Monte Carlo modeling extracts information on chain initiation and termination from complete PLP-SEC traces. Macromolecules 2017, 50, 1371–1385. [Google Scholar] [CrossRef]
Beuermann, S.; Harrisson, S.; Hutchinson, R.A.; Junkers, T.; Russell, G.T. Update and critical reanalysis of IUPAC benchmark propagation rate coefficient data. Polym. Chem. 2022, 13, 1891–1900. [Google Scholar] [CrossRef]
Beuermann, S.; Buback, M.; Davis, T.P.; Gilbert, R.G.; Hutchinson, R.A.; Kajiwara, A.; Klumperman, B.; Russell, G.T. Critically evaluated rate coefficients for free-radical polymerization, 3. Propagation rate coefficients for alkyl methacrylates. Macromol. Chem. Phys. 2000, 201, 1355–1364. [Google Scholar] [CrossRef]
Beuermann, S.; Buback, M.; Davis, T.P.; Gilbert, R.G.; Hutchinson, R.A.; Olaj, O.F.; Russell, G.T.; Schweer, J.; Van Herk, A.M. Critically evaluated rate coefficients for free-radical polymerization, 2. Propagation rate coefficients for methyl methacrylate. Macromol. Chem. Phys. 1997, 198, 1545–1560. [Google Scholar] [CrossRef]
Huang, D.M.; Monteiro, M.J.; Gilbert, R.G. A theoretical study of propagation rate coefficients for methacrylonitrile and acrylonitrile. Macromolecules 1998, 31, 5175–5187. [Google Scholar]
Van de Reydt, E.; Marom, N.; Saunderson, J.; Boley, M.; Junkers, T. A Predictive machine-learning model for propagation rate coefficients in radical polymerization. Polym. Chem. 2023, 14, 1622–1629. [Google Scholar] [CrossRef]
Shi, Y.; Yu, M.; Liu, J.; Yan, F.; Luo, Z.-H.; Zhou, Y.-N. Quantitative structure–property relationship model for predicting the propagation rate coefficient in free-radical polymerization. Macromolecules 2022, 55, 9397–9410. [Google Scholar] [CrossRef]
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. [Google Scholar] [CrossRef]
Morris, P.; St. Clair, R.; Hahn, W.E.; Barenholtz, E. Predicting binding from screening assays with transformer network embeddings. J. Chem. Inf. Model. 2020, 60, 4191–4199. [Google Scholar]
Durant, J.L.; Leland, B.A.; Henry, D.R.; Nourse, J.G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280. [Google Scholar] [CrossRef]
Ranstam, J.; Cook, J.A. LASSO regression. J. Br. Surg. 2018, 105, 1348. [Google Scholar] [CrossRef]
Buback, M.; Kurz, C.H.; Schmaltz, C. Pressure dependence of propagation rate coefficients in free-radical homopolymerizations of methyl acrylate and dodecyl acrylate. Macromol. Chem. Phys. 1998, 199, 1721–1727. [Google Scholar] [CrossRef]
Haehnel, A.P.; Schneider-Baumann, M.; Arens, L.; Misske, A.M.; Fleischhaker, F.; Barner-Kowollik, C. Global trends for kp? The influence of ester side chain topography in alkyl (meth) acrylates−completing the data base. Macromolecules 2014, 47, 3483–3496. [Google Scholar] [CrossRef]
Hutchinson, R.; Aronson, M.; Richards, J. Analysis of pulsed-laser-generated molecular weight distributions for the determination of propagation rate coefficients. Macromolecules 1993, 26, 6410–6415. [Google Scholar] [CrossRef]
Pascal, P.; Winnik, M.A.; Napper, D.H.; Gilbert, R.G. Pulsed laser study of the propagation kinetics of tert-butyl methacrylate. Die Makromol. Chem. Rapid Commun. 1993, 14, 213–215. [Google Scholar] [CrossRef]
Hutchinson, R.A.; Beuermann, S. Critically evaluated propagation rate coefficients for radical polymerizations: Acrylates and vinyl acetate in bulk (IUPAC Technical Report). Pure Appl. Chem. 2019, 91, 1883–1888. [Google Scholar] [CrossRef]
Luong, K.-D.; Singh, A. Application of Transformers in Cheminformatics. J. Chem. Inf. Model. 2024, 64, 4392–4409. [Google Scholar] [CrossRef] [PubMed]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Tipping, M.E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 2001, 1, 211–244. [Google Scholar]
Cha, G.-W.; Moon, H.J.; Kim, Y.-M.; Hong, W.-H.; Hwang, J.-H.; Park, W.-J.; Kim, Y.-C. Development of a prediction model for demolition waste generation using a random forest algorithm based on small datasets. Int. J. Environ. Res. Public Health 2020, 17, 6997. [Google Scholar] [CrossRef]

Figure 1. Results of the fitting analyses for predicting ln(k_p)^25°C versus experimental ln(k_p)^25°C: (a) multivariate linear regression model; (b) Lasso regression model; (c) ridge regression model; (d) Bayesian ridge regression model [13].

Figure 2. Regression models trained by Reydt et al. (a) All monomers (R² = 0.7221, RMSE = 1.0125); (b) (meth)acrylates, styrene, and acrylonitrile (R² = 0.9855, RMSE = 0.2269) [13].

Figure 3. APE distribution of predicted k_p^{25 °C} and experimental k_p^{25 °C} for (a) multivariate linear regression; (b) Lasso regression; (c) ridge regression; and (d) Bayesian ridge regression.

Figure 4. Predictive results on the test dataset: (a) multivariate linear regression; (b) Lasso regression; (c) ridge regression; (d) Bayesian ridge regression.

Figure 5. Fitting analyses on the training set for (a) E_A; (b) ln(A).

Figure 6. APE of predicted and experimental values on the test dataset for (a) E_A; (b) A.

Table 1. Test dataset of k_p values and Arrhenius parameters for new monomers.

Monomers	Abbr.	A [L mol⁻¹ s⁻¹]	E_A [KJ mol⁻¹]	k_p^{25 °C} [L mol⁻¹ s⁻¹]
Dodecyl acrylate [19]	DA	10,900,000	15.80	18,588
Tridecyl acrylate [20]	TDA	5,710,000	14.08	19,489
Tert-butyl methacrylate [22]	t-BMA	25,100,000	27.70	352
Chloroprene [21]	CP	19,500,000	26.63	421

Table 2. Predicted k_p values using Lasso regression.

Monomers	k_p ^{25 °C} [L mol⁻¹ s⁻¹]
Dodecyl acrylate	17,682
Tridecyl acrylate	20,333
Tetradecyl acrylate	22,034
Pentadecyl acrylate	22,635
Tetradecy methacrylate	611

Table 3. Predictive results of Arrhenius parameters and k_p at multiple temperatures for four new monomers.

Monomers	Abbr.	T [°C]	Predicted k_p [L mol⁻¹ s⁻¹]	Predicted A [L mol⁻¹ s⁻¹]	Predicted E_A [KJ mol⁻¹]
Dodecyl acrylate	DA	15	14,621	5,887,746	14.38
Dodecyl acrylate	DA	25	17,682
Dodecyl acrylate	DA	35	21,396
Dodecyl acrylate	DA	45	25,672
Dodecyl acrylate	DA	55	30,192
Dodecyl acrylate	DA	65	35,402
Tridecyl acrylate	TDA	15	16,627	7,247,982	14.56
Tridecyl acrylate	TDA	25	20,333
Tridecyl acrylate	TDA	35	24,705
Tridecyl acrylate	TDA	45	29,663
Tridecyl acrylate	TDA	55	34,847
Tridecyl acrylate	TDA	65	40,749
Tert-butyl methacrylate	t-BMA	15	270	3,407,435	22.59
Tert-butyl methacrylate	t-BMA	25	381
Tert-butyl methacrylate	t-BMA	35	505
Tert-butyl methacrylate	t-BMA	45	664
Tert-butyl methacrylate	t-BMA	55	857
Tert-butyl methacrylate	t-BMA	65	1105
Chloroprene	CP	15	317	4,181,175	22.71
Chloroprene	CP	25	440
Chloroprene	CP	35	589
Chloroprene	CP	45	780
Chloroprene	CP	55	1019
Chloroprene	CP	65	1284

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Fang, Y.; Zhou, H.; Gao, H. A Machine Learning Model for Predicting the Propagation Rate Coefficient in Free-Radical Polymerization. Molecules 2024, 29, 4694. https://doi.org/10.3390/molecules29194694

AMA Style

Wang Y, Fang Y, Zhou H, Gao H. A Machine Learning Model for Predicting the Propagation Rate Coefficient in Free-Radical Polymerization. Molecules. 2024; 29(19):4694. https://doi.org/10.3390/molecules29194694

Chicago/Turabian Style

Wang, Yiming, Yue Fang, Haifan Zhou, and Hanyu Gao. 2024. "A Machine Learning Model for Predicting the Propagation Rate Coefficient in Free-Radical Polymerization" Molecules 29, no. 19: 4694. https://doi.org/10.3390/molecules29194694

APA Style

Wang, Y., Fang, Y., Zhou, H., & Gao, H. (2024). A Machine Learning Model for Predicting the Propagation Rate Coefficient in Free-Radical Polymerization. Molecules, 29(19), 4694. https://doi.org/10.3390/molecules29194694

Article Menu

A Machine Learning Model for Predicting the Propagation Rate Coefficient in Free-Radical Polymerization

Abstract

1. Introduction

2. Results and Discussion

2.1. Comparison of Four Regression Models on the Training Dataset

2.2. Construction of the External Test Dataset