Next Article in Journal
Bioactive Potential of Algae and Algae-Derived Compounds: Focus on Anti-Inflammatory, Antimicrobial, and Antioxidant Effects
Previous Article in Journal
Insights on the Bonding Mechanism, Electronic and Optical Properties of Diamond Nanothread–Polymer and Cement–Boron Nitride Nanotube Composites
Previous Article in Special Issue
Clustering Molecules at a Large Scale: Integrating Spectral Geometry with Deep Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Machine Learning Model for Predicting the Propagation Rate Coefficient in Free-Radical Polymerization

Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, China
*
Author to whom correspondence should be addressed.
Molecules 2024, 29(19), 4694; https://doi.org/10.3390/molecules29194694 (registering DOI)
Submission received: 6 September 2024 / Revised: 27 September 2024 / Accepted: 30 September 2024 / Published: 3 October 2024
(This article belongs to the Special Issue Deep Learning in Molecular Science and Technology)

Abstract

:
The propagation rate coefficient (kp) is one of the most crucial kinetic parameters in free-radical polymerization (FRP) as it directly governs the rate of polymerization and the resulting molecular weight distribution. The kp in FRP can typically be obtained through experimental measurements or quantum chemical calculations, both of which can be time consuming and resource intensive. Herein, we developed a machine learning model based solely on the structural features of monomers involved in FRP, utilizing molecular embedding and a Lasso regression algorithm to predict kp more efficiently and accurately. The result shows that the model achieves a mean absolute percentage error (MAPE) of only 5.49% in the predictions for four new monomers, which indicates that the model exhibits strong generalization capabilities and provides reliable and robust predictions. In addition, this model can accurately predict the influence of the ester side chain length of (meth)acrylates on kp, aligning well with established scientific knowledge. This approach offers a straightforward and practical model for other researchers to rapidly obtain accurate kp values by employing monomer structural information. The model is sufficiently general to apply to a wide range of (meth)acrylate and butadiene FRP monomers, thereby supporting kinetic modeling of polymerization reactions.

1. Introduction

Kinetic rate coefficients in free-radical polymerization (FRP) are crucial in polymerization modeling and optimization, which can lead to the design and synthesis of novel materials [1]. These parameters are an integral part of a microkinetic model of polymerization (including chemical reaction mechanisms at the molecular level and differential equations describing concentration changes over time), which can predict important reaction performance metrics including monomer conversion, molecular weight distribution, and dispersity. Among these key kinetic parameters, the propagation rate coefficient (kp) holds particular significance since kp governs the overall rate of the polymerization reaction and reflects the inherent reactivity of the monomer so that it can be used to model the polymerization behavior of different monomers [2].
The kp values are typically obtained through a variety of approaches, including direct and indirect experimental measurements as well as quantum chemical calculations [3,4,5]. Direct experimental measurements of the kp are typically conducted via the pulsed laser polymerization–size exclusion chromatography (PLP-SEC) technique, based on the rapid and periodic formation of primary radicals induced by high-frequency laser pulses [6,7]. Kinetic modeling and regression analysis serve as indirect experimental methods to predict kp from measured temporal change of concentration data [8]. In addition to these experimental approaches, quantum chemical calculations employing high-level ab initio molecular orbital theory have been utilized to provide mechanistic insights and estimate kp values [3].
However, these methods have their own limitations. For the direct experimental method, PLP-SEC is the standard method recommended by the IUPAC Working Party for the experimental determination of kinetic parameters such as the kp, activation energy (EA), and pre-exponential factor (A) in FRP [9]. This technique employs laser pulses to periodically induce radical formation in monomer solutions, resulting in characteristic peaks in the polymer molecular weight distribution that enable the determination of the kp. By measuring kp at multiple temperatures using this approach, the Arrhenius equation for the propagation of that monomer can be derived. Guided by this principle, Buback et al. have determined the Arrhenius equation for the kp of styrene, while Beuermann et al. have carried out similar work for methyl methacrylate, ethyl methacrylate, and dodecyl methacrylate [7,10,11]. However, these measurements can be quite complex and time consuming, and reliable kp values require IUPAC harmonization of data submitted from different laboratories. Kinetic modeling and regression analysis are indirect experimental approaches to computationally determine the kp in FRP, including deterministic methods and stochastic approaches. For instance, Zhou et al. utilized the method of moments to model the apparent kp for the polymerization of butyl methacrylate and methyl methacrylate at 25 °C [5], and the results were found to be consistent with PLP-SEC data. Marien et al. developed an isothermal kinetic Monte Carlo model that accounts for all relevant elementary reactions to accurately simulate the complete PLP-SEC traces and extract the kp for n-butyl acrylate [8]. However, this is essentially a data-fitting process that still relies on experimental measurements of the dynamic changes of concentrations, which lacks predictive power. Apart from these experimental methods, quantum chemical calculations coupled with transition state theory and high-level ab initio molecular orbital theory have been employed to calculate intrinsic rate coefficients of FRP. For example, Heuts et al. utilized quantum chemical calculations to determine the geometries, energies of reactants, vibrational frequencies, and the transition state, and then applied transition state theory to obtain the Arrhenius parameters [3]. The calculated results for ethylene propagation were in good agreement with PLP-SEC data, although this approach is limited to propagation reactions that are not significantly influenced by the presence of solvents. Huang et al. calculated the kp values for acrylonitrile and methacrylonitrile, and investigated the hindering effect of the methyl substituent on the rotational degrees of freedom in the transition state through quantum chemical calculations [12]. However, the accuracy of kp values obtained from quantum chemical calculations depends on the precise description of the energetic and entropic barriers, and as the number of atoms in the monomer molecules increases, the computational cost and time required would increase dramatically.
Recently, researchers have turned to the use of machine learning methods to predict kp values. Reydt et al. achieved good fitting results using machine learning for (meth)acrylate-type FRP monomers [13]. They utilized GAMESS and ChemSpider to obtain various physicochemical properties of the monomers, including dipole moment, boiling point, melting point, surface tension, refractive index, and polarizability, and used these as features in a ridge regression model to predict kp. While the model showed reasonable predictive performance for (meth)acrylate monomers, its ability to predict kinetic parameters for other FRP monomers, such as butadiene-type and styrene-type, was limited. Furthermore, the complexity of their features, which required computational software (GAMESS, version: 2018, R1) to obtain, also restricted the generalization capability of their model in predicting kp for new monomers. Recently, Shi et al. developed a quantitative structure–property relationship model based on density functional theory and machine learning regression analysis to predict Arrhenius parameters and kp, achieving high accuracy [14]. However, this machine learning approach still requires complex quantum chemical calculations, which are time consuming. Therefore, it is desirable to develop a generalized, accurate, and computationally efficient machine learning model for the prediction of kp.
Herein, we propose a machine learning model based on the molecular structure of monomers to predict kp values and Arrhenius parameters. Firstly, the PubChem database was utilized to obtain the Simplified Molecular-Input Line-Entry System (SMILES) representations of various monomers [15]. Subsequently, the SMILES was converted into both Molecular ACCess System (MACCS) fingerprints and Molecular Transformer embeddings [16,17], and these two types of fingerprints were then combined as the input features. The features were used to predict kp using a Lasso regression model with a regularization term [18], which enabled automated feature selection of more influential variables while also preventing overfitting. To validate the generalizability of our model, monomers from external datasets were tested using this model. We demonstrated that by simply using molecular fingerprints derived from 2D molecular graphs (or the equivalent SMILES representation), the propagation rate coefficients can be predicted for (meth)acrylate and butadiene monomers with high accuracy, which can strongly contribute to the simulation of polymerization reactions and design of polymerization systems.

2. Results and Discussion

The fitting performance of the four regression models was first evaluated on the training set of 41 monomers under the same standards as shown in Table S1. Furthermore, the predictive capability of the trained models was compared on monomers outside the training set to select the models with the strongest generalization ability. Finally, reasonable predictions were also made for the Arrhenius pre-exponential factor A and activation energy EA to predict the kp values of new monomers at different temperatures.

2.1. Comparison of Four Regression Models on the Training Dataset

The use of ln(kp) values at 25 °C was adopted for convenient comparison across the different models. The use of the natural logarithm of the rate constant, rather than the raw kp values, was employed to avoid the models predicting unphysical negative values. This approach also enhanced the statistical validity of the regression analysis, resulting in more robust and reliable models.
As shown in Figure 1, the fitting results of the four regression models all exhibited high quality, with coefficient of determination (R2, Equation (S1)) values exceeding 0.9978 and root-mean-square errors (RMSE, Equation (S2)) less than 0.1000. This demonstrates the excellent fitting performance of these regression models on the training dataset and also indicates that the feature transformation process effectively preserved the majority of the structural information.
Figure 2 presents the fitting performance using the features selected by Reydt et al. [13], including molecular weight, polarity, and boiling point. Figure 2a shows the results for the same 41 monomers, while Figure 2b displays their fitting on a training set comprising (meth)acrylates, styrene, and acrylonitrile, after excluding monomers that could not be adequately described by their feature set. In the latter case, their model achieved an R2 of 0.9855 and an RMSE of 0.2269, still inferior to any of the four regression models presented here.
To provide a more intuitive comparison of the prediction errors across the four models, the absolute percentage error (APE, Equation (S3)) between the predicted and actual kp values was also examined.
As depicted in Figure 3, consistent with the results evaluated using R2 or RMSE, the multivariate linear regression model exhibited the most favorable performance, with over 70% of the monomers having APEs below 3%, further corroborating the efficacy of the selected features. In contrast, the ridge regression model showed the poorest performance, with nearly 50% of the monomers having APEs greater than 6%, indicating that the addition of the regularization penalty in ridge regression was not well suited for this training set. The Bayesian ridge regression, which combined the characteristics of ridge and Lasso regressions, yielded slightly inferior fitting results on the training set compared to Lasso regression, suggesting that the regularization term in Lasso regression was better able to capture the inherent relationship between the features and ln(kp). However, given the extremely limited data points, the superior performance on the training set may be attributed to overfitting, which could lead to poor predictive capabilities for new monomers. Therefore, the four models were evaluated on a separate test dataset of new monomers.

2.2. Construction of the External Test Dataset

In order to test the generalizability of the model, we obtained a few more data points from the literature outside the training set. As shown in Table 1, the kinetic data for dodecyl acrylate, tridecyl acrylate, tert-butyl methacrylate, and chloroprene were obtained from four different experimental laboratories to reduce the overall impact of experimental errors associated with PLP-SEC measurements on the evaluation of the model [19,20,21,22]. Notably, the data reported for tridecyl acrylate were actually obtained from a sample of Tridecyl N acrylate, which is a combination of isomers of tridecyl acrylate with partially esterified side chains [20]. In contrast, the literature reported that tridecyl A acrylate represents a distinct set of tridecyl acrylate isomers. Nevertheless, the kp value of tridecyl N acrylate is larger than that of dodecyl acrylate, while the kp value of tridecyl A acrylate is smaller than that of dodecyl acrylate. Based on the general trend that longer acrylate ester side chains lead to larger kp values [4], we inferred that the tridecyl N acrylate monomer had a relatively lower degree of side chain branching, and thus considered its kp value as a reference for tridecyl acrylate (TDA). Furthermore, the rationality of the model predictions was then interpreted using the scientific principles of FRP.

2.3. Comparison of Four Regression Models on the Test Dataset

As shown in Figure 4, mean absolute percentage error (MAPE, Equation (S4)) was used to evaluate the overall predictive capability of the models on the four new monomers, while the individual APE values for each of the four monomers could detect the predictive capability of the models on different types of monomers.
As depicted in Figure 4c,d, not only did ridge regression and Bayesian ridge regression exhibit poor performance on the training set, but they also showed relatively inferior performance on the four new monomers, with MAPE values of 30.00% and 30.83%, respectively. The rationale behind this is that the regularization method employed in ridge regression is more appropriate for situations characterized by multicollinearity among the predictor variables. Conversely, if the feature set comprises orthogonal predictor variables, ridge regression may not demonstrate optimal performance. This suggests that the sub-structural features of monomers have independent impacts on the final kp values in FRP.
Additionally, the simplest direct multivariate linear regression model also exhibited acceptable results, with a MAPE of only 11.60% on the test dataset. This indicates that the features we inputted were sufficient to represent the monomer sub-structures that influence the chain propagation rate. However, the multivariate linear regression model failed to correctly validate the scientific principle that longer acrylate ester side chains lead to larger kp values [4]. For example, at 25 °C, the predicted kp values for ethyl acrylate and propyl acrylate were 21,713 and 13,007 L mol−1 s−1, respectively. This suggests a lack of a regularization penalty, which may have caused a certain degree of overfitting on the training set.
Lasso regression showed the best test results, with a MAPE of only 5.49% for the four new monomers, and each monomer’s prediction bias was less than 8%. The rationale is that Lasso regression minimizes the sum of the absolute values of the regression coefficients, enabling it to shrink the coefficients of secondary variables to exactly zero while retaining the primary variables that influence the kp values. Consequently, during the training process, the Lasso regression model can automatically select the key sub-structural features of the monomers that impact the kp values, such as methyl substitutions on carbon–carbon double bonds, halogen substituents, and ester side chain lengths.
Overall, with the SMILES representations of the monomers as the initial features and proper feature engineering, Lasso regression outperforms any current ab initio calculations and the machine learning model of Reydt et al. [13], which used partial monomer properties as features, in predicting kp.

2.4. Reflection of Scientific Principles

In the field of FRP, two well-established scientific laws have been validated. The first law states that for linear alkyl (meth)acrylates, the kp gradually increases as the number of carbon atoms in the ester side chain increases [4,20]. The second law suggests that for (meth)acrylate esters with the same side chain, the kp of the acrylate ester is two orders of magnitude higher than that of the corresponding methacrylate ester [13]. To check against these two postulates, the optimal Lasso regression model was employed to predict the kp values of several new monomers, and the results are presented in Table 2.
The predictive results show that as the number of side-chain carbon atoms increases from 12 to 15 for dodecyl acrylate, tridecyl acrylate, tetradecyl acrylate, and pentadecyl acrylate, the kp values indeed gradually increase from 17,682 to 22,635. Furthermore, the kp value of Tetradecyl acrylate (22,034) is two orders of magnitude higher than that of Tetradecyl methacrylate (611), in accordance with the established scientific law. The predictive results that align with these scientific laws demonstrate the significant potential of our model in accurately and reliably predicting the kp values of new monomers.

2.5. Predictions of kp at Multiple Temperatures, A, and EA

As in the manner of obtaining the Arrhenius parameters from the PLP-SEC experiments, kp values of monomers at different temperatures were simultaneously predicted and linearly fitted to yield the A and EA. Specifically, the kp values at 15 °C, 25 °C, 35 °C, 45 °C, 55 °C, and 65 °C were used as labels to train six well-fitted models, as shown in Figure S9. The R2 values for these models were all above 0.9980, and the RMSEs were all below 0.0800. Subsequently, the A and EA parameters were obtained by linearly fitting the kp values at the six temperatures according to the Arrhenius equation, and the results of the training set are presented in Figure 5.
In contrast to the optimization of a single predictive model, the error associated with forecasting the parameters A and EA using the ensemble of six models will exhibit a marginal increase. However, this is still far superior to the results reported by Reydt et al. [13]. Our EA (R2 = 0.9932, RMSE = 0.3961) and ln(A) (R2 = 0.9714, RMSE = 0.1601) outperform their EA (R2 = 0.9630, RMSE = 0.8230) and ln(A) (R2 = 0.6660, RMSE = 0.4270). This further demonstrates the effectiveness of our model in directly using SMILES structures for prediction.
As shown in Table 3, the predicted Arrhenius parameters EA and A for the four new monomers on the test dataset are also provided. First, the six well-fitted models obtained from the training set were employed to predict the kp values of the four new monomers at the six temperatures. Subsequently, the EA and A parameters were derived by linearly fitting the predicted kp values according to the Arrhenius equation.
The errors between the predicted and experimental values are illustrated in Figure 6. For the EA predictions, the MAPE for the four new monomers was only 11.37%, and the APE was below 18% for all monomers. This further demonstrates the high accuracy and robustness of the model in predicting kp values. In contrast, the prediction error for the A values was relatively larger, with a MAPE of 59.48%.
However, it is important to note that the PLP-SEC experiments used to measure kp values inherently have an experimental uncertainty of 10–15% [23]. While this error is orthogonal to model prediction errors (our model is trained to predict the single-point experimental kp values without considering these experimental uncertainties), it is reasonable to conclude that an error level of 10–15% in kp values is practically acceptable. The A values are derived from fits of kp values across multiple temperatures, and in this fitting process, A values are very sensitive to small changes in EA values. Considering this, the prediction error for the A parameter is deemed acceptable in the context of FRP modeling.
Although our model is capable of providing reliable kp value predictions for new (meth)acrylate and butadiene FRP monomers, limitations still exist. The accuracy of the kp predictions may decline when certain sub-structural features of monomers are absent from the training set. Therefore, to enhance the model’s generalization capability, it is essential to expand the dataset to include a broader variety of FRP monomers. Furthermore, due to the inherent uncertainty in the PLP-SEC experimental data, although a small APE is exhibited by the predicted results compared to the PLP-SEC data, the APE in relation to the objectively true kp values may be larger.

3. Methods

3.1. Construction of the Training Dataset

The dataset of kinetic parameters for the monomers of FRP used to train our model was presented in Tables S1 and S2. These data have been curated by Reydt et al. [13]. The majority of the data originate from the benchmark datasets recognized by IUPAC, while the remaining data come from individual laboratory experiments that are also considered reliable under IUPAC standards. The kp values at different temperatures were given by the Arrhenius equation:
k p = A exp E A R T
where A is the pre-exponential factor, EA is the activation energy for propagation reactions, R is the universal gas constant, and T is the absolute temperature.

3.2. Feature Representation

3.2.1. Scientific Understanding

Reydt et al. achieved satisfactory model fitting results in their development of machine learning models for the prediction of kp by using intrinsic properties of the monomers as features, such as molecular weight, dipole moment, boiling point, melting point, and dissociation constant [13]. Specifically, molecular weight can indirectly influence kp by reflecting changes in the length of the ester side chains in (meth)acrylate monomers. Dipole moment captures differences in polarity, which can directly impact both the reactivity in free-radical reactions and the solvation environment during bulk polymerization, thereby affecting the rate of propagation. The other features serve as complementary descriptors of the monomers’ molecular properties.
However, even though they identified the key molecular properties of monomers that influence the propagation rate, it remains challenging to provide a comprehensive representation of the molecular properties affecting kp. Other potentially influential molecular characteristics, such as the spatial environment of double bonds, may have been overlooked as features in the model due to the difficulty in their quantitative description. Furthermore, some properties affecting kp that are not well understood could not be incorporated. In addition, when applying the model to predict the kp of new monomers, the physical properties might be unknown (e.g., melting point), which limits the generalizability of this approach.
Since molecular properties are fundamentally determined by molecular structure, utilizing the complete molecular structure of the monomers as features for the machine learning model could resolve the issue of incomplete statistical representation of molecular properties. Herein, we used the SMILES representation of the monomer molecules as the initial features, which were then transformed into an encoding format suitable for machine learning to construct an accurate and generalizable model.

3.2.2. SMILES and MACCS Fingerprints

The SMILES, based on the principles of molecular graph theory, provides a standardized representation of molecular structures, capturing information about atoms, bonds, aromaticity, stereochemistry, and other molecular features [15]. This makes the SMILES the raw input for our model. MACCS fingerprints and Molecular Transformer embeddings are two widely adopted methods for converting SMILES representations into molecular features.
MACCS fingerprints are binary vectors of 166 bits, where each bit represents the presence (1) or absence (0) of a predefined structural fragment within the molecule [17]. However, while MACCS fingerprints capture the main atomic, bond, and functional group information of a molecule, using this fingerprint alone may not be sufficient to represent the complete structural details of FRP monomers. For instance, dodecyl acrylate and tridecyl acrylate have the same structure, except for a single carbon atom difference in the ester side chain. As shown in Table S4, a model using only MACCS fingerprints as input provided the same predicted results for these two monomers, indicating they possess completely identical sets of the 166 structural fragments and therefore have the same fingerprints.

3.2.3. Molecular Transformer Embeddings

SMILES representations of monomers were simultaneously converted into Molecular Transformer embeddings as a complement to MACCS fingerprints, to maximize the retention of molecular structural information [16]. The Molecular Transformer is a deep learning model that can generate embeddings (vector representations) from the SMILES strings of molecules, capturing their structural information. It is based on the Transformer architecture, consisting of an encoder and a decoder, and here we utilize only the encoder to obtain the molecular embeddings.
The conversion process involves tokenization, positional encoding, an embedding layer, a Transformer encoder, layer normalization, and output aggregation. During these processes, the positional encoding and the attention mechanism of the Transformer itself play an important role in preserving the structural information of small molecules [24]. The positional encoding can reflect the relative positional information of the atoms in the SMILES representation, which better captures the information of the cyclic structure and the length of side chains of the molecules. Furthermore, the global perception ability brought by the attention mechanism allows each atom or functional group to focus on any other element in the SMILES sequence, which can better represent the resonance effects, conjugation effects, and the overall structure of the molecules.
However, reliable Molecular Transformer embeddings for molecular structures require extensive training to determine the optimal settings of the Transformer architecture parameters. As shown in Figure S1, Morris et al. collected 8,300,000 molecular SMILES strings and IUPAC names from PubChem to train the Transformer model [16]. Additionally, they have made the pre-trained Transformer models publicly available, which were subsequently utilized by us to obtain embeddings for the 41 FRP monomers in our dataset. The resulting embeddings are 2D matrices with the number of rows equal to the length of the SMILES strings and the number of columns set to 512. These matrices were then averaged across the rows to produce 1D matrices with 512 columns, which were used as the input for regression models.
Ultimately, the 1D vector MACCS fingerprints and the 1D matrix of the Molecular Transformer embeddings were concatenated as the feature inputs of the 41 monomers for regression models. As illustrated in Supplementary Figure S2, SMILES strings were respectively transformed into MACCS fingerprints, Molecular Transformer embeddings, and their combination encodings, which were then used to train three separate predictive models. The results (Figures S4–S6) demonstrate that integrating these two encoding approaches can maximize the preservation of structural information during the SMILES-to-encoding conversion process.

3.3. Algorithms of Regression Models

An appropriate machine learning model needs to be employed to identify the underlying relationship between these sub-structural features and the corresponding kp values. Our approach to predicting the kp values of new monomers is based on the statistical correlation between structure and property. We initially explored some complex machine learning algorithms, such as XGBoost (version: 2.0.3) and LightGBM (version: 4.4.0) [25,26], to fit the training data. However, as shown in Figure S6, due to the limited dataset size of only 41 data points, these sophisticated models were unable to effectively learn the inherent relationship between the structures and the kp values, resulting in poor fitting performance.
Therefore, it is hypothesized that a simpler multivariate linear regression method may perform better on this small dataset, as it may be able to capture the essential features of the structure–property relationship more robustly. We have compared the performance of several regression methods (Figure S7), including multivariate linear regression [13], Lasso regression [18], ridge regression [13], and Bayesian ridge regression [27].

3.3.1. Multivariate Linear Regression

Multivariate linear regression is the most straightforward and direct method, which can be expressed using the following equation:
y = β 0 + i = 1 n β i x i
where y is the dependent variable, x represents the multiple independent variables, and β’s are the regression coefficients. Linear regression employs the method of least squares to estimate the values of β, thereby minimizing the sum of squared residuals between the predicted and true values. However, the simplicity of linear regression may lead to overfitting, potentially resulting in poor predictive performance on new monomers.

3.3.2. Ridge Regression and Lasso Regression

Ridge regression builds upon the foundation of multivariate linear regression by introducing a regularization penalty term in the loss function to prevent overfitting. The loss function can be expressed as follows:
L = y X β 2 + λ β 2
where X is the feature matrix of independent variables, and the first term y X β 2   represents the sum of squared residuals between the predicted and actual values, which quantifies the model’s fit error. The second term λ β 2 is the product of the regularization parameter λ and the sum of squared regression coefficients. The parameter λ can be tuned to control the model complexity, thereby balancing the trade-off between variance and bias. Ridge regression is well suited for situations where there is significant multicollinearity among the predictor variables.
Lasso regression is similar to ridge regression, but the regularization term in the loss function is changed to the sum of the absolute values of the regression coefficients, which can be expressed as the following:
L = y X β 2 + λ β
This form of regularization can cause some of the regression coefficients to be precisely shrunk to zero, effectively performing feature selection. Therefore, Lasso regression can automatically select more influential features during the training process and discard the irrelevant ones. Lasso regression ultimately yields a sparse model, which facilitates the interpretation of individual feature influences on the target variable. This renders Lasso regression particularly well suited for scenarios involving a large number of features but a relatively small sample size.

3.3.3. Bayesian Ridge Regression

Bayesian ridge regression is a probabilistic model that extends the concept of ridge regression by incorporating Bayesian principles [27]. It provides a probabilistic approach to estimating regression coefficients, assuming a prior distribution of the coefficients and then deriving their posterior distribution after observing the data. It first assumes that the noise term β0 in Equation (2) follows a normal distribution with a mean of 0 and a variance of σ2, while the regression coefficients β are assumed to have a Gaussian prior distribution. Subsequently, it is assumed that the noise precision α, which is the inverse of σ2, follows a Gamma prior distribution. Then, using Bayes’ theorem, the posterior distribution of the parameters is calculated based on the prior distributions and the likelihood function of the observed data:
p β , α X , y p y X , β , α p β α p α
where p y X , β , α   is the likelihood function of the data, p β α   is the prior distribution of the coefficients, and p α is the prior distribution of the noise precision. Therefore, Bayesian ridge regression provides a probabilistic interpretation of model parameters, while also enabling automatic feature selection. This imbues the method with the benefits of ridge regression’s suitability for scenarios involving significant multicollinearity among predictor variables, as well as the feature selection capability of Lasso regression.

3.4. Validation Methods

Reydt et al. have validated the effectiveness of leave-one-out cross-validation (LOOCV, Figure S8) in the analysis of regression model errors [13]. Therefore, the performance of the models on the training sets was also evaluated using this method. Specifically, multivariate linear regression and Bayesian ridge regression only used a single outer LOOCV, while ridge and Lasso regressions employed both outer and inner LOOCV. The outer LOOCV involved n fittings, each time using n − 1 monomers as the training data and the remaining one as the test data. The inner LOOCV was performed within the outer LOOCV framework, with n(n − 1) fittings to determine the optimal regularization parameter λ. Ultimately, the n model fittings from the outer LOOCV produced n predictions for each monomer in the training set. The average results of the n models in LOOCV were compared to the experimental values to evaluate the overall fitting performance, thus avoiding the influence of random errors from any single model. Similarly, when predicting kp for new monomers, the average prediction of the n models was compared to the experimental data.

4. Conclusions

In summary, by using solely the SMILES representations of FRP monomers as input, we have developed a reliable and robust Lasso regression model that provides accurate predictions of kp values. This model exhibits strong generalization capabilities, eliminating the need for monomer physical properties during kp value predictions. The features in the model provide an accurate description of monomer structural information, enabling reasonable predictions as long as the sub-structural units (atoms, bonds, functional groups) of the new monomers have been encountered in the training set. Attractively, the Lasso regression model achieves a high R2 of 0.9985 on the training set, and a MAPE of only 5.49% in the predictions for the four new monomers, significantly outperforming the accuracy of quantum chemical calculations as well as previously reported machine learning models. Furthermore, the influence of the ester side chain length of (meth)acrylates on kp was accurately predicted by this model, aligning well with established scientific knowledge. This high-accuracy and highly generalizable predictive model allows other researchers to simply input monomer information and rapidly obtain reliable kp estimates, thereby accelerating their investigations into FRP mechanisms. In the future, it is worthwhile to explore the extension of this model to other chain-growth systems, such as anionic and cationic polymerization, while ensuring the collection of sufficient kinetic data and the incorporation of solvents and initiators into features.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/molecules29194694/s1. Tables S1–S3: Training dataset. Equations (S1)–(S4): Definitions of statistical parameters. Figure S1: Diagram of Molecular Transformer embeddings. Figure S2: Diagram of SMILES processing. Figures S3–S5 and Table S4: Comparison of MACCS fingerprints, Molecular Transformer embeddings, and their combination. Figure S6: Fitting results of XGBoost and LightGBM. Figure S7: Flowchart of algorithms. Figure S8: Diagram of LOOCV. Figure S9: Fitting results at various temperatures. Tables S5–S10: Detailed results for Figure 1, Figure 4 and Figure 5. Reference [28] is cited in the Supplementary Materials.

Author Contributions

Conceptualization, Y.W. and H.G.; methodology, Y.W. and H.Z.; software, Y.W. and H.Z.; validation, Y.W.; data curation, Y.W.; writing—original draft preparation, Y.W. and Y.F.; writing—review and editing, Y.W., Y.F. and H.G.; visualization, Y.W. and Y.F.; supervision, H.G. All authors have read and agreed to the published version of the manuscript.

Funding

Haifan Zhou is financially supported by the Hong Kong Ph.D. Fellowship Scheme (PF22-.78203). Yue Fang is financially supported by the Hong Kong Research Grants Council Early Career Scheme (26214522) and the HKUST Start-Up Grant.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code in this paper is available at https://github.com/jamesymwang/Kp-predict_MACCS-and-Molecular-Transformer (accessed on 6 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Beuermann, S.; Buback, M. Rate coefficients of free-radical polymerization deduced from pulsed laser experiments. Prog. Polym. Sci. 2002, 27, 191–254. [Google Scholar]
  2. Nikitin, A.N.; Lacík, I.; Hutchinson, R.A. A 3D simulation investigation of the influence of temperature increases on the accuracy of propagation rate coefficients determined by Pulsed-Laser Polymerization. Macromolecules 2016, 49, 9320–9335. [Google Scholar] [CrossRef]
  3. Heuts, J.P.; Gilbert, R.G.; Radom, L. A priori prediction of propagation rate coefficients in free-radical polymerizations: Propagation of ethylene. Macromolecules 1995, 28, 8771–8781. [Google Scholar] [CrossRef]
  4. Kockler, K.B.; Haehnel, A.P.; Junkers, T.; Barner-Kowollik, C. Determining Free-Radical Propagation Rate Coefficients with High-Frequency Lasers: Current Status and Future Perspectives. Macromol. Rapid Commun. 2016, 37, 123–134. [Google Scholar] [CrossRef] [PubMed]
  5. Zhou, Y.-N.; Luo, Z.-H. Copper (0)-mediated reversible-deactivation radical polymerization: Kinetics insight and experimental study. Macromolecules 2014, 47, 6218–6229. [Google Scholar] [CrossRef]
  6. Barner-Kowollik, C.; Günzler, F.; Junkers, T. Pushing the limit: Pulsed laser polymerization of n-butyl acrylate at 500 Hz. Macromolecules 2008, 41, 8971–8973. [Google Scholar] [CrossRef]
  7. Buback, M.; Gilbert, R.G.; Hutchinson, R.A.; Klumperman, B.; Kuchta, F.D.; Manders, B.G.; O’Driscoll, K.F.; Russell, G.T.; Schweer, J. Critically evaluated rate coefficients for free-radical polymerization, 1. Propagation rate coefficient for styrene. Macromol. Chem. Phys. 1995, 196, 3267–3280. [Google Scholar] [CrossRef]
  8. Marien, Y.W.; Van Steenberge, P.H.; Barner-Kowollik, C.; Reyniers, M.-F.o.; Marin, G.B.; D’hooge, D.R. Kinetic Monte Carlo modeling extracts information on chain initiation and termination from complete PLP-SEC traces. Macromolecules 2017, 50, 1371–1385. [Google Scholar] [CrossRef]
  9. Beuermann, S.; Harrisson, S.; Hutchinson, R.A.; Junkers, T.; Russell, G.T. Update and critical reanalysis of IUPAC benchmark propagation rate coefficient data. Polym. Chem. 2022, 13, 1891–1900. [Google Scholar] [CrossRef]
  10. Beuermann, S.; Buback, M.; Davis, T.P.; Gilbert, R.G.; Hutchinson, R.A.; Kajiwara, A.; Klumperman, B.; Russell, G.T. Critically evaluated rate coefficients for free-radical polymerization, 3. Propagation rate coefficients for alkyl methacrylates. Macromol. Chem. Phys. 2000, 201, 1355–1364. [Google Scholar] [CrossRef]
  11. Beuermann, S.; Buback, M.; Davis, T.P.; Gilbert, R.G.; Hutchinson, R.A.; Olaj, O.F.; Russell, G.T.; Schweer, J.; Van Herk, A.M. Critically evaluated rate coefficients for free-radical polymerization, 2. Propagation rate coefficients for methyl methacrylate. Macromol. Chem. Phys. 1997, 198, 1545–1560. [Google Scholar] [CrossRef]
  12. Huang, D.M.; Monteiro, M.J.; Gilbert, R.G. A theoretical study of propagation rate coefficients for methacrylonitrile and acrylonitrile. Macromolecules 1998, 31, 5175–5187. [Google Scholar]
  13. Van de Reydt, E.; Marom, N.; Saunderson, J.; Boley, M.; Junkers, T. A Predictive machine-learning model for propagation rate coefficients in radical polymerization. Polym. Chem. 2023, 14, 1622–1629. [Google Scholar] [CrossRef]
  14. Shi, Y.; Yu, M.; Liu, J.; Yan, F.; Luo, Z.-H.; Zhou, Y.-N. Quantitative structure–property relationship model for predicting the propagation rate coefficient in free-radical polymerization. Macromolecules 2022, 55, 9397–9410. [Google Scholar] [CrossRef]
  15. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. [Google Scholar] [CrossRef]
  16. Morris, P.; St. Clair, R.; Hahn, W.E.; Barenholtz, E. Predicting binding from screening assays with transformer network embeddings. J. Chem. Inf. Model. 2020, 60, 4191–4199. [Google Scholar]
  17. Durant, J.L.; Leland, B.A.; Henry, D.R.; Nourse, J.G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273–1280. [Google Scholar] [CrossRef]
  18. Ranstam, J.; Cook, J.A. LASSO regression. J. Br. Surg. 2018, 105, 1348. [Google Scholar] [CrossRef]
  19. Buback, M.; Kurz, C.H.; Schmaltz, C. Pressure dependence of propagation rate coefficients in free-radical homopolymerizations of methyl acrylate and dodecyl acrylate. Macromol. Chem. Phys. 1998, 199, 1721–1727. [Google Scholar] [CrossRef]
  20. Haehnel, A.P.; Schneider-Baumann, M.; Arens, L.; Misske, A.M.; Fleischhaker, F.; Barner-Kowollik, C. Global trends for kp? The influence of ester side chain topography in alkyl (meth) acrylates−completing the data base. Macromolecules 2014, 47, 3483–3496. [Google Scholar] [CrossRef]
  21. Hutchinson, R.; Aronson, M.; Richards, J. Analysis of pulsed-laser-generated molecular weight distributions for the determination of propagation rate coefficients. Macromolecules 1993, 26, 6410–6415. [Google Scholar] [CrossRef]
  22. Pascal, P.; Winnik, M.A.; Napper, D.H.; Gilbert, R.G. Pulsed laser study of the propagation kinetics of tert-butyl methacrylate. Die Makromol. Chem. Rapid Commun. 1993, 14, 213–215. [Google Scholar] [CrossRef]
  23. Hutchinson, R.A.; Beuermann, S. Critically evaluated propagation rate coefficients for radical polymerizations: Acrylates and vinyl acetate in bulk (IUPAC Technical Report). Pure Appl. Chem. 2019, 91, 1883–1888. [Google Scholar] [CrossRef]
  24. Luong, K.-D.; Singh, A. Application of Transformers in Cheminformatics. J. Chem. Inf. Model. 2024, 64, 4392–4409. [Google Scholar] [CrossRef] [PubMed]
  25. Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  26. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  27. Tipping, M.E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 2001, 1, 211–244. [Google Scholar]
  28. Cha, G.-W.; Moon, H.J.; Kim, Y.-M.; Hong, W.-H.; Hwang, J.-H.; Park, W.-J.; Kim, Y.-C. Development of a prediction model for demolition waste generation using a random forest algorithm based on small datasets. Int. J. Environ. Res. Public Health 2020, 17, 6997. [Google Scholar] [CrossRef]
Figure 1. Results of the fitting analyses for predicting ln(kp)25°C versus experimental ln(kp)25°C: (a) multivariate linear regression model; (b) Lasso regression model; (c) ridge regression model; (d) Bayesian ridge regression model [13].
Figure 1. Results of the fitting analyses for predicting ln(kp)25°C versus experimental ln(kp)25°C: (a) multivariate linear regression model; (b) Lasso regression model; (c) ridge regression model; (d) Bayesian ridge regression model [13].
Molecules 29 04694 g001aMolecules 29 04694 g001b
Figure 2. Regression models trained by Reydt et al. (a) All monomers (R2 = 0.7221, RMSE = 1.0125); (b) (meth)acrylates, styrene, and acrylonitrile (R2 = 0.9855, RMSE = 0.2269) [13].
Figure 2. Regression models trained by Reydt et al. (a) All monomers (R2 = 0.7221, RMSE = 1.0125); (b) (meth)acrylates, styrene, and acrylonitrile (R2 = 0.9855, RMSE = 0.2269) [13].
Molecules 29 04694 g002
Figure 3. APE distribution of predicted kp25 °C and experimental kp25 °C for (a) multivariate linear regression; (b) Lasso regression; (c) ridge regression; and (d) Bayesian ridge regression.
Figure 3. APE distribution of predicted kp25 °C and experimental kp25 °C for (a) multivariate linear regression; (b) Lasso regression; (c) ridge regression; and (d) Bayesian ridge regression.
Molecules 29 04694 g003
Figure 4. Predictive results on the test dataset: (a) multivariate linear regression; (b) Lasso regression; (c) ridge regression; (d) Bayesian ridge regression.
Figure 4. Predictive results on the test dataset: (a) multivariate linear regression; (b) Lasso regression; (c) ridge regression; (d) Bayesian ridge regression.
Molecules 29 04694 g004aMolecules 29 04694 g004b
Figure 5. Fitting analyses on the training set for (a) EA; (b) ln(A).
Figure 5. Fitting analyses on the training set for (a) EA; (b) ln(A).
Molecules 29 04694 g005
Figure 6. APE of predicted and experimental values on the test dataset for (a) EA; (b) A.
Figure 6. APE of predicted and experimental values on the test dataset for (a) EA; (b) A.
Molecules 29 04694 g006
Table 1. Test dataset of kp values and Arrhenius parameters for new monomers.
Table 1. Test dataset of kp values and Arrhenius parameters for new monomers.
MonomersAbbr.A
[L mol−1 s−1]
EA
[KJ mol−1]
kp25 °C
[L mol−1 s−1]
Dodecyl acrylate [19]DA10,900,00015.8018,588
Tridecyl acrylate [20]TDA5,710,00014.0819,489
Tert-butyl methacrylate [22]t-BMA25,100,00027.70352
Chloroprene [21]CP19,500,00026.63421
Table 2. Predicted kp values using Lasso regression.
Table 2. Predicted kp values using Lasso regression.
Monomerskp 25 °C [L mol−1 s−1]
Dodecyl acrylate17,682
Tridecyl acrylate20,333
Tetradecyl acrylate22,034
Pentadecyl acrylate22,635
Tetradecy methacrylate611
Table 3. Predictive results of Arrhenius parameters and kp at multiple temperatures for four new monomers.
Table 3. Predictive results of Arrhenius parameters and kp at multiple temperatures for four new monomers.
MonomersAbbr.T
[°C]
Predicted kp
[L mol−1 s−1]
Predicted A
[L mol−1 s−1]
Predicted EA
[KJ mol−1]
Dodecyl acrylateDA1514,6215,887,74614.38
Dodecyl acrylateDA2517,682
Dodecyl acrylateDA3521,396
Dodecyl acrylateDA4525,672
Dodecyl acrylateDA5530,192
Dodecyl acrylateDA6535,402
Tridecyl acrylateTDA1516,6277,247,98214.56
Tridecyl acrylateTDA2520,333
Tridecyl acrylateTDA3524,705
Tridecyl acrylateTDA4529,663
Tridecyl acrylateTDA5534,847
Tridecyl acrylateTDA6540,749
Tert-butyl methacrylatet-BMA152703,407,43522.59
Tert-butyl methacrylatet-BMA25381
Tert-butyl methacrylatet-BMA35505
Tert-butyl methacrylatet-BMA45664
Tert-butyl methacrylatet-BMA55857
Tert-butyl methacrylatet-BMA651105
ChloropreneCP153174,181,17522.71
ChloropreneCP25440
ChloropreneCP35589
ChloropreneCP45780
ChloropreneCP551019
ChloropreneCP651284
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Fang, Y.; Zhou, H.; Gao, H. A Machine Learning Model for Predicting the Propagation Rate Coefficient in Free-Radical Polymerization. Molecules 2024, 29, 4694. https://doi.org/10.3390/molecules29194694

AMA Style

Wang Y, Fang Y, Zhou H, Gao H. A Machine Learning Model for Predicting the Propagation Rate Coefficient in Free-Radical Polymerization. Molecules. 2024; 29(19):4694. https://doi.org/10.3390/molecules29194694

Chicago/Turabian Style

Wang, Yiming, Yue Fang, Haifan Zhou, and Hanyu Gao. 2024. "A Machine Learning Model for Predicting the Propagation Rate Coefficient in Free-Radical Polymerization" Molecules 29, no. 19: 4694. https://doi.org/10.3390/molecules29194694

Article Metrics

Back to TopTop