1. Introduction
Esterification is a fundamental chemical reaction that is used in the synthesis of esters from acids and alcohols [
1,
2]. This reaction is important in producing not only various industrial intermediates but is also widely applied in fragrances and flavours [
3]. The outcome of an esterification reaction is often dictated by the procedural steps that are followed. Precise procedures, encompassing parameters like the choice of solvents, temperature, and duration, are important to achieve the desired product with a high yield and purity [
4,
5]. Such procedures serve as a plan for chemists, guiding them through the process of organic synthesis, ensuring reproducibility, and minimizing the chances of unwanted side products. However, predicting the optimal procedure for a given set of reactants remains a challenge, often requiring iterative experimentation with various chemical reagents (acid-catalyzed Fisher esterification, Steglich esterification, etc.). Recent advancements in deep learning (DL) have demonstrated its potential in modeling chemical properties and reactions [
6,
7,
8,
9]. In addition, a shift in computational research methodologies towards viewing chemistry as a text-to-text task signifies a new perspective in the domain [
10]. By treating chemical reactions as sequences, like sentences in natural language processing, researchers can utilize machine learning models, originally designed for language translation, to predict chemical outcomes. This novel perspective allows for advanced predictions, optimizations, and innovations in the chemical domain using linguistic models for scientific advancement. Leveraging large language models (LLMs) that have been pre-trained on vast datasets inherently grants the capacity to understand and generate chemical text to a certain degree. For generative text-to-text tasks, such models can be adapted to chemical contexts, bridging the gap between language processing and chemistry [
11]. Building upon this, our research aims to develop an in silico methodology to predict accurate procedures for esterification reactions. We utilize a dataset of esterification reactions to fine-tune LLMs and then test the performance of the models. Using the methodology presented in this paper, chemists may reconsider their synthesis strategies, ultimately optimizing their reaction conditions before initiating the actual reaction. This predictive approach promises tangible benefits, such as increased efficiency and substantial savings in terms of time and resources.
2. Related Work
Predicting the optimal procedures for organic reactions, including esterification, is a complex task due to several factors. Firstly, esterification is an umbrella term for a foundational reaction in organic chemistry, which is responsible for forming esters from diverse processes [
12]. The Fischer esterification, a classical example, involves the transformation of a carboxylic acid and an alcohol with an acid catalyst [
13]. A more contemporary technique is the Steglich esterification, which engages carbodiimides, like EDAC (1-Ethyl-3-(3-dimethylaminopropyl)carbodiimide) and DMAP (4-Dimethylaminopyridine), to expedite the interaction between carboxylic acids and alcohols [
14]. Additionally, the Mitsunobu reaction, which deploys DEAD (Diethyl azodicarboxylate) along with triphenylphosphine, offers an alternative for the synthesis of esters from primary and secondary alcohols [
15]. The solvent’s role in esterification reactions cannot be understated, dictating aspects like the equilibrium and reaction speed. Familiar solvents, such as DMF (Dimethylformamide), DCM (Dichloromethane), and THF (Tetrahydrofuran), are usually used in these reactions [
16,
17]. Historically, chemists have relied on empirical knowledge, documented procedures, and iterative experimentation to determine the best conditions for a given synthesis [
18]. However, with the increasing complexity of organic molecules containing various functional groups and the need for efficient and sustainable synthesis, there is a growing demand for computational tools that can predict the procedures and parameters of reactions. Several computational methods have been developed to predict reaction conditions, utilizing databases of known reactions. These methods, while promising, often require extensive computational resources and may not always provide accurate predictions for novel or less-studied reactions [
19]. The evolution of computational methods has seen a pivot toward using machine learning and artificial intelligence to interpret reaction datasets. These techniques excel at discerning patterns within the data, thereby possibly improving the predictions for novel reactions. Contrasting with previous approaches, such as the one employing a transformer-based sequence-to-sequence model which attained a BLEU score of 54.7 through text-based representations [
20], our research undertakes a similar predictive task but diverges in several key aspects. We introduce a novel and specific dataset for the esterification reaction, employ an alternative procedural notation, and utilize a distinct linguistic framework. Furthermore, our methodological innovation is in processing only the SMILES representations of the molecules that are implicated in the reactions, focusing the input specifically on the transformative elements of the reactions.
Deep learning (DL) methodologies have shown remarkable success in various chemistry tasks [
21]. Recent advancements in transformer architectures have significantly impacted molecular generation in drug discovery. Noteworthy models such as MolBERT, ChemGPT, T5Chem, MolFormer, Chemformer, and BARTSmiles, which employ NLP techniques, demonstrate this influence [
22,
23,
24,
25,
26,
27]. For example, Chemformer and T5Chem are pre-trained on extensive SMILES strings from ZINC-15 [
28] and Pubchem, respectively, and are fine-tuned for chemical prediction tasks. Additionally, graph transformers, such as RetroExplainer, have been used in retrosynthesis, automating organic chemistry processes through a transparent, deep learning-guided molecular assembly process [
29]. Further broadening the scope of DL methodologies, Galactica has been trained on diverse scientific data, including chemical information, with a character-based approach to SMILES tokenization [
30]. A novel contribution to this field is nach0, an encoder-decoder LLM pre-trained on the scientific literature and molecule strings, and excelling in various chemical and biological tasks with its ability to generate high-quality molecular and textual outputs [
31]. This model stands out for its performance in both single and cross-domain tasks, surpassing existing models in efficiency and output quality. Despite these advancements, specific research on GPT-3.5 and GPT-4 in chemistry applications like reaction prediction and retrosynthesis remains limited [
32]. While GPT models initially lagged behind existing ML baselines, partly due to challenges in interpreting molecular SMILES strings, recent developments in fine-tuned GPT-3.5-turbo models have shown promise. These models outperform traditional transformers, including the T5 and Llama2-13b-chat [
33] models, particularly in extracting action sequences from experimental procedures [
34]. OpenAI’s models, such as davinci-002 and GPT-3.5-turbo, are competent at chemistry questions with reasonable prompts [
35]. Even earlier models, such as GPT-3, have been shown to perform impressively well for a wide range of questions about chemistry [
36]. LLMs have also been used to power ChemCrow, an innovative method that integrates computational tools with chemistry, showcasing its capability in planning syntheses and solving various chemical reasoning tasks, from simple drug discovery to intricate molecular design [
37]. Similarly, a GPT-4 model was utilized to create a multi-LLM agent that can autonomously design, plan, and execute complex scientific experiments [
38].
In this paper, we test different methods that predict the procedure of a reaction, which consists of a sequence of formally described actions, such as Add, Heat, Extract, Crystallize, and parameters that are associated with the actions, such as the temperature, duration, solvents, and catalysts. We use a dataset of 1200 reactions and test the k-nearest neighbors (kNN) algorithm, fine-tuned OpenAI models (GPT-3.5-turbo, davinci-002, babbage-002), and a fine-tuned FLAN-T5 model. The contribution of this research is: (1) the pioneering use of the fine-tuned GPT-3.5-turbo model to predict chemical synthesis procedures for esterification reaction, which encompasses an extensive array of actions (28 distinct actions). (2) A distinctive aspect of our approach is the exclusion of ancillary compounds from the model inputs—only reactants and products are provided, and we deliberately omit any non-reaction-specific agents such as gases, solvents, and catalysts, including EDAC or DMAP. The complexity of the task is increased as the model cannot depend exclusively on the input for forecasting all the requisite steps and parameters, encompassing the ancillary compounds. Nonetheless, the output is significantly more useful to the researcher because, particularly with novel compounds lacking extensive synthesis documentation, only the reactants and products are typically predetermined before initiating laboratory experiments. Our study conducts a comprehensive comparison between cutting-edge artificial intelligence models and conventional algorithms, setting a standard for subsequent inquiries in this domain. Consequently, our research presents an innovative perspective on reaction planning, highlighting the synergy between LLMs and chemical processes.
3. Formal Definition of Tasks
In the paper, a generative text-to-text problem is solved. Given a source chemical reaction description, r = (r1, r2, …, rn), in SMILES notation (e.g., reactant.reactant >> product), the task is to generate a target procedure description, p = (p1, p2, …, pm), in a formal, machine-readable format that conveys the specific steps, conditions, and parameters for the described reaction. Let R be the space of all possible source reaction descriptions in natural English language notation and P be the space of all possible target procedure descriptions in the formal, machine-readable format. In our case, P is restricted to esterification reactions—a category of synthesis reactions of esters from alcohols and acids. Let Θ be an ML algorithm that could learn a function, ϕ(R)→P, which maps a source reaction description to its corresponding target procedure description.
The goal of Θ is to learn an approximation (denoted as ϕ) of the function ϕ from a training dataset DR ⊂ R, where each source reaction description, r, in DR has a corresponding target procedure description, p, in the formal format. The learned function ϕ is evaluated on a separate testing dataset, DT ⊂ R, which consists of reaction descriptions that have not been seen during the training phase. Finally, the model’s performance is evaluated based on how similar the predictions are to the target procedures using an objective evaluation metric.
4. The Data
The dataset utilized in our experiments is derived from a comprehensive set of synthesis procedures found in USPTO and EPO patents issued between 1971 and 2022 [
39]. The dataset was constructed through a unique methodology proposed in this article. This methodology combines machine learning algorithms and scripts to systematically extract and transform experimental procedures from these patents into structured actions. The pipeline involves two primary tasks: firstly, classifying patent paragraphs to accurately identify chemical procedures, and secondly, converting these procedures into a structured format. The datasets differs from the commonly used USPTO-50k [
40] or USPTO-MIT [
41] because it includes both reactants and products in SMILES format along with the synthesis procedures, which are in a simplified and machine-readable format. The second version of the publicly available dataset has been used, because it has been additionally improved by the removal of irregular action sequences. While the primary dataset encompasses various reaction classes, we employed the open-source software DataWarrior 5.5.0 [
42] to isolate only esterification reactions. The selection of esterification reactions for isolation from the broader dataset was done because of the scope of the raw data, which include millions of instances, necessitating a focused subset for detailed analysis. Esterification, despite being a common class of chemical reaction, encompasses a variety of subtypes, each with distinct procedural steps, rendering the task of accurate procedure prediction challenging. This choice permits an in-depth exploration of a well-defined reaction type within a manageable dataset size, enabling the development and refinement of the ML algorithm for complex, real-world applications. The refined dataset comprises pairs of input and output instances. The inputs are represented as single lines of text, denoting reactants and products in SMILES (Simplified Molecular Input Line Entry System [
43], used in chemistry to represent chemical structures simply and unambiguously) notation, whereas the outputs describe a series of actions and their respective parameters. The actions are limited and represented in a structured and simplified format: a solitary word signifies the action, succeeded by its specific parameters. The actions are:
Add,
CollectLayer,
Concentrate,
Degas,
DrySolid,
DrySolution,
Extract,
Filter,
FollowOtherProcedure,
MakeSolution,
Microwave,
OtherLanguage,
Partition,
PH,
PhaseSeparation,
InvalidAction,
Purify,
Quench,
Recrystallize,
NoAction,
Reflux,
SetTemperature,
Sonicate,
Stir,
Triturate,
Wait,
Wash, and
Yield. The schema employed for action, parameter naming, and formatting was initially introduced by Lowe and subsequently refined by IBM, and it currently stands as the most exhaustive for this task. For our application, reactants and products have been tokenized and are denoted by
$R1
$,
$R2
$, …,
$RN
$ for reactants and by
$P1
$,
$P2
$, …,
$PN
$ for products. Such tokenization is efficient, does not require the reaction compounds to be copied over to the resulting procedure, and avoids potential errors in the notation. The input is case-sensitive due to the nature of SMILES notation, whereby aromatic atoms are denoted in lower-case letters, while aliphatic atoms are denoted in upper-case letters. The output is also case-sensitive because it makes it easier to discern between action names, compound names, abbreviations, and parameters (temperature, duration, etc.). Computer code can be used to extract relevant information from the procedures to conduct analysis and potentially apply them to a variety of robotic synthesizers. Therefore, the syntax is strict and any minor spelling errors in the words cause the whole word to be considered incorrect. A sample from the dataset is available in
Table 1.
The dataset comprises a total of 1200 samples. The data have been checked manually and corrected by a knowledgeable chemist where necessary. While the input data remained predominantly unaltered, the output data saw modifications in instances of duplicated identical actions or instances of incongruent solvent nomenclatures, such as the replacement of “iso propanol” with the correct “isopropanol”. The dataset was shuffled and partitioned into test (100 samples), validation (100 samples), and training sets (1000 samples). To evaluate the influence of the training dataset size on the algorithm performance, additional training subsets of 500, 150, and 50 samples were created. Notably, the 150-sample dataset encompasses all instances present in the 50-sample set, and similarly, the 500-sample set contains all from the 150-sample set. The exploration of smaller training subsets within our dataset is based on various considerations. In other reaction classes beyond esterification, the number of reactions available may not be as extensive; therefore, it is meaningful to simulate conditions where data scarcity is a factor. Furthermore, there are logistical constraints such as the high costs associated with chemical experiments, limitations in existing data repositories, and the significant effort required to cleanse data of noise, which may impede the collection of large training sets. This reality underscores the value of testing the algorithm’s performance with reduced sample sizes. Additionally, this approach is utilized to assess the sufficiency of smaller datasets for model training, leveraging the repetitive nature of sentence structures and recurrent action patterns within procedural texts, which could allow for effective learning and generalization from fewer examples.
The input character sequences span a range from 22 to 293 characters in length. The average input lengths across subsets were consistent: validation (85.83 characters), testing (92.38 characters), and training sets of 1000 (87.09 characters), 500 (84.92 characters), 150 (83.17 characters), and 50 samples (82.24 characters). The synthesis procedures (output) ranged from a minimum of 4 actions to a maximum of 29. The average action counts in the outputs across the subsets were also uniform: validation (12.07 actions), testing (11.68 actions), and training sets of 1000 (12.26 actions), 500 (12.34 actions), 150 (11.91 actions), and 50 samples (11.42 actions). Such consistency indicates that the training, validation, and testing subsets are representative, ensuring reliable test scores and confident conclusions regarding the task. The data are presented in
Table 2. The full dataset is publicly available in the project repository online:
https://github.com/Mantas-it/LLM_Esterification (accessed on 9 November 2023).
To establish a baseline for the trained models, random and majority methods were selected. The random method outputs a random output from the training dataset, while the majority method simply outputs the most common sentence (in the training dataset). The BLEU score is calculated using the entire training dataset, which consists of 1000 examples. The majority and random baselines yield BLEU scores of 24.16 and 36.56, respectively. The test results are later compared to these baselines to determine if the fine-tuned models perform better. If the results surpass these baselines, it will suggest that the fine-tuned models possess sufficient predictive capability to be considered as potential solutions for our problem. In such a case, the next step would be to identify the one that demonstrates the optimal performance among all of the tested approaches.
5. Applied Machine Learning Approaches
In the field of computational chemistry, the critical task of translating molecular representations into a machine-readable format is a prerequisite for the application of machine learning techniques. This transformation is crucial, especially given that the methodologies we evaluate necessitate a numerical input. Therefore, such inputs require vectorization. The utility of supervised machine learning methodologies is well-established in the domain of predictive modelling, particularly in scenarios where there is a clear mapping from input to output. In the context of our study, which aims to predict organic chemistry procedures, the supervised learning paradigm serves as an appropriate framework for training models on the curated dataset. Consequently, we have explored several vectorization approaches and tested three machine learning algorithms, each accompanied by a selection of hyper-parameters to optimize their performance for our task.
FLAN-T5 Model. The model FLAN-T5-base (created by Google) [
44] is an evolved variant of the T5 [
45] architecture that can vectorize SMILES strings into contextually rich representations that may capture the chemical semantics embedded within them. The T5′s foundational transformer architecture employs the SentencePiece tokenizer [
46], which breaks down words into subwords or tokens. Each of these tokens is then vectorized into context vectors of varying lengths, depending on the model scale used. In our case, context vectors are refined within the token’s vicinity, a process intrinsic to the T5 learning paradigm, where embeddings are adjusted to reflect the token’s contextual relevance. These vectors, once concatenated, maintain the discrete boundaries between tokens while forming a comprehensive sequence representation [
47]. For SMILES notations, the tokenization may differ from natural language processing, potentially involving individual characters as tokens. Regarding the sequence and context vector lengths, FLAN-T5-base uses a 512-element context vector with an embedding dimension of 786. The maximum output length was also set to 512. This FLAN-T5 model was further fine-tuned on a domain-specific dataset, focusing on esterification reactions. The fine-tuning process was not conducted from scratch but was based on models that were pre-trained by Google. The FLAN-T5-base model was fine-tuned using Hugging Face’s library [
48]. Predominantly, a text-to-text transformer model is premised on the idea that most NLP tasks can be framed as a text-to-text problem, where both the input and output are sequences of text [
49,
50,
51]. The model leverages an encoder-decoder structure, using a stack of self-attention mechanisms. Fine-tuning is important to adapt the generalized knowledge of pre-trained models to specific, often narrower, domains or applications. This process usually involves additional training epochs on a target dataset, allowing the model to specialize and achieve higher performance metrics in specific tasks. The model used in this study has been fine-tuned with different sizes of datasets, with a learning rate of 0.00005, which has been found to lead to optimal results at around epoch 60 in most cases. A batch size of 4 has been used with four gradient accumulation steps, resulting in an effective batch size of 16.
OpenAI’s GPT models. OpenAI’s GPT models utilize an architecture known as transformers, which are pivotal for their language generation capabilities [
52,
53]. The core idea behind these models is the use of embeddings, similar to FLAN-T5 embeddings, to convert words or tokens into continuous vector spaces. The models utilize attention mechanisms, particularly self-attention, to weigh the importance of different words in a sentence, allowing them to generate coherent and contextually relevant text. The larger the model, GPT-3 or GPT-4 being examples of large models, the more capacity it has to store relationships in its embeddings. OpenAI provides three models for fine-tuning (GPT-3.5-turbo (proprietary information), davinci-002 (12288 embedding dimensions), and babbage-002 (2048 embedding dimensions)). These models are part of the GPT series of models, which are large-scale language models designed to generate text. The models named Davinci and GPT-3.5-turbo are among the largest models in the GPT-3 series by OpenAI, while babbage-002 is a smaller variant of the GPT-3 series. The GPT-3.5-turbo model can include a system message, which after testing in OpenAI’s playground was set to “Write a very concise procedure given the reactants and products, esterification reaction. Use one-word actions and precise temperatures and durations. Skip measurements”. The system message has very concrete instructions and results in a procedure being produced with simple words and parameters. It is important to note that, during testing, the model’s predictions before fine-tuning were very general and did not match our dataset at all. Therefore, we did not even perform the objective evaluation. This suggests that the current model (despite the version: GPT-3.5-turbo, davinci-002, babbage-002, etc.) is not able to make a reasonable guess and does not contain enough knowledge to create one. While it is known to have some initial understanding of organic chemistry, our tasks appeared to be too specific. A hyper-parameter temperature (affective randomness of the models’ generated output) was set to 0.05, as this was found to be the most reasonable for fine-tuned models, resulting in mostly deterministic and reasonable predictions, and has been shown to work best with custom prompts [
35]. All models have been trained separately for three and six epochs as the values higher than six resulted in the model’s overfitting and degraded the performance scores.
TF-IDF+kNN. The Term Frequency-Inverse Document Frequency (TF-IDF) vectorization method is utilized for textual data to represent the significance of terms within a given text or document. The TF-IDF can be adjusted to perform the character-level vectorization of SMILES strings (e.g., in [
54,
55]). For our application, the vectorizer was set up to be case-sensitive to accommodate SMILES notation. In its foundational sense, the TF-IDF calculation reflects the significance of a character in a reaction relative to its frequency across multiple reaction notations in SMILES [
56]. For our method, the vectorizer inherently ranks characters based on their significance in the training dataset, potentially highlighting frequently occurring chemical motifs or functional groups in the reactions. Although TF-IDF offers a robust means of vectorization, it is essential to note that its efficacy is dependent on the nature of the data and the application. Although this popular vectorization method is the simplest among all of our tested approaches, the dataset input contains a limited number of terms within esterification reactions. Consequently, the TF-IDF may already be sufficient for our solving problem or at least serve as an alternative baseline. The vectorization has been performed for the TF-IDF matrix using TfidfVectorizer from the scikit-learn Python module [
57], using our prepared dataset. The kNN algorithm is a relatively simple memory-based approach [
58], with the training phase involving the straightforward storage of training instances. Specifically, the kNN methodology involves identifying the k (in our case, k = 1) instances in the training set that are closest to the tested one. The tested instance obtains the textual output of the closest one. The “closeness” is determined by the distance (Euclidean [
59], Levenshtein [
60]) or similarity (cosine [
61]) metrics. All three of these metrics were experimentally investigated with our dataset, resulting in the best performance with the Levenshtein metric.
6. Results
The following experiments were performed with all subsets (described in
Section 4), using the vectorization methods and supervised machine learning algorithms (in
Section 5). The models were tested by comparing the true values from the testing dataset to the generated ones. The BLEU (Bilingual Evaluation Understudy) score metric was used. The BLEU score, originally developed for evaluating machine translation quality, has become a valuable metric in today’s diverse NLP tasks for assessing the quality of generated text [
62]. It offers a numerical measure of how closely the generated text matches the reference.
The BLEU score calculates the geometric mean of modified n-gram (
N = 4) precision (
Pn) with a brevity penalty (BP) which helps to mitigate the issue of shorter generated texts compared to the reference:
where
N is the maximum order of n-grams considered, and the brevity penalty (
BP) is defined as:
The best results for each model and dataset size are illustrated in
Figure 1. For each model, different sizes of the training datasets that were used to fine-tune the model (for example—size 50, size 150, etc.) are displayed in different colors. A full table of our results can be found in
Appendix A.
7. Discussion
Upon closer observation, one can note the correlation between the data size and performance in all of the tested approaches. The increased performance observed from training on from 50 to 1000 data points indicates that all of the tested deep learning models, particularly FLAN-T5, benefit significantly from larger datasets. The only exception is davinci-002 (between size 50 and size 150); however, the difference is minimal. Such an observation suggests that larger datasets are necessary for even better results. The FLAN-T5 model benefits the most from the increased size of the training dataset (a difference in score of 12.48 between size 50 (39.34) and 1000 (51.82)), while the GPT-3.5-turbo saw the least amount of improvement (6.45) between size 50 (41.54) and 1000 (47.99). This also suggests that the larger models need even more data to benefit from the size of the training dataset. Considering the nature of chemistry, each input is often unique, because different products are being synthesized; therefore, models must deal with a significant variety of text. It has been shown that few-shot learning can be applied to a variety of tasks; however, contrasting our results with the English language, where recurrent phrases can assist in semantic extraction, the challenges in chemistry text processing are evident [
63]. Therefore, any discussion of results in the following section will consider only training on the largest dataset.
Compared to the random and majority baseline scores (random—36.56, majority—24.16), all of the tested methods with all sizes of training datasets demonstrate a reasonable performance and predictive power. Looking at
Table 3, the FLAN-T5 model has emerged as the best among all of the tested approaches, achieving the top BLEU score of 51.82, which can be considered as high-quality translation. One can attribute this to the sequence-to-sequence nature of FLAN-T5, which makes it well-suited for translating chemical reactions from reactants and products to procedures. Despite this, the OpenAI models (particularly GPT-3.5-turbo) underperform compared to FLAN-T5 but also deliver promising results. With a BLEU score only marginally less than FLAN-T5 (47.99), GPT-3.5-turbo showcases the versatility of transformer architectures, which, although initially designed for language processing, can learn to understand the intrinsic structure and relationships in molecular notation. Of course, structured, but more natural, language in the output makes the task a little easier, as models consider all of their context (including the generated content), but the input remains complicated. Paradoxically, despite starting from a lower position (i.e., lacking the ability to generate any procedures, even very rudimentary ones, and not having demonstrated proficiency in answering questions of a chemical nature), the FLAN-T5-base model (with 250 M parameters) was able to surpass the GPT-3.5-turbo (154 billion parameters) (which was capable of generating very basic and general procedures for synthesis reactions). GPT-3.5-turbo possessed extensive language knowledge, which may have overlapped with its understanding of chemical knowledge, potentially leading to ambiguity. In contrast, FLAN-T5 gathered all of the necessary information primarily from the training dataset. This difference may help explain why FLAN-T5 outperformed GPT-3.5-turbo.
When comparing the GPT models with each other, GPT-3.5-turbo and davinci-002 (with 6 billion parameters) achieved very similar results, especially with the largest dataset, and both outperformed babbage-002 (125 million parameters), which is the smallest of the three. There seems to be a clear correlation between a GPT model’s complexity and its ability to adapt effectively to novel problems it has never encountered before.
Interestingly, the TF-IDF+kNN algorithm is not the worst, i.e., it was able to outperform generative transformer babbage-002. The highest BLEU score of 46.69 was achieved with the Levenshtein distance metric using the largest training dataset. Its underlying principle of using similar reactions to predict procedures seems aligned with the task at hand. While TF-IDF+kNN is a relatively simple algorithm, its performance, which closely rivals that of some transformer-based methods in this domain, is truly remarkable.
The evaluated methods have the potential of both traditional and state-of-the-art models in processing and understanding chemical reactions. The FLAN-T5-base model stands out among deep learning models with its superior BLEU score, which can be attributed to its sequence-to-sequence architecture. However, the second-best, OpenAI’s GPT-3.5-turbo, while not outperforming FLAN-T5-base, also demonstrates its adaptability. One of the key takeaways from the results is that fine-tuned OpenAI models can be outperformed by a significantly smaller model having no knowledge about chemistry, which can be fine-tuned on a single workstation.
A feature of the FLAN-T5-base model is its autonomous nature, serving as an independent, self-contained solution that does not require reliance on third-party services. Furthermore, this model is accessible free of charge, enhancing its appeal for researchers and institutions looking to perform research without incurring additional costs. This aspect reinforces the model’s value as a practical tool for scientific inquiry where budget constraints are a consideration. Our results demonstrate that models for specific tasks need to be tested on many different types and sizes of model to discern the top performers. Furthermore, the positive outcomes provide a strong impulse for further exploration in this area, including the utilization of larger datasets that encompass a broader range of reaction types and the exploration of various models. Moreover, there exist numerous other large language models, and new ones continue to emerge regularly. In addition, specific findings, such as the superior performance of the FLAN-T5 model (despite its initial limited chemistry knowledge) compared to GPT-3.5-turbo, raise further questions. There might be a need for a model with a clear separation between chemistry-specific input and structured yet natural language text generation in the output. This could also be an interesting direction for future research.
In conducting our error analysis, the performance of the fine-tuned FLAN-T5-base model was evaluated using prediction examples from the testing dataset. The analysis of these errors is crucial, as reliance solely on quantitative metrics can be misleading, particularly when dealing with smaller datasets where nuanced discrepancies may not be adequately captured [
64]. The examples obtained in this study, detailed in
Table 3, fall into three distinct categories: (1) technically correct, (2) partially correct, and (3) incorrect. It was observed that none of the model’s predictions exactly replicated the original sentences, a result that aligns with expectations given the average action count of 12 in the procedures. For a more nuanced understanding, the predictions were segmented into three classifications. Technically correct predictions are deemed those with minor errors that did not compromise the procedural integrity. An exemplar case in
Table 3 is depicted in highlighted sections, where the action
MakeSolution—an action to combine chemical components—was shown with the compounds in an alternative sequence. Since there is no strict ordering for compound identifiers in such actions, this deviation is inconsequential to the procedural outcome. In similar instances, the repeated use of the action
Add instead of
MakeSolution yielded technically equivalent procedures, accounting for approximately 10% of the test dataset. Most predictions, ranging between 70 and 80%, were categorized as partially correct. These predictions deviated in certain aspects, like action sequence or parameters, from the true procedures. An illustrative instance in
Table 3 shows discrepancies in the temperature and duration for the
Stir action, along with differences in the description of adding ethyl acetate, sodium bicarbonate, and water. The model’s prediction included assumptions about partitioning and layer collection, along with an extra washing step with brine, diverging from the original procedure. Such errors, while not invalidating the procedures, vary in their impact: some may hinder the synthesis process, while others could enhance clarity or efficiency. Finally, the incorrect category, encompassing 10–20% of the instances, as exemplified in
Table 3, includes predictions with significant deviations from the true procedures. However, this does not inherently imply laboratory ineffectiveness, as these predictions often follow a general pattern of adding chemicals, processing (waiting, stirring, heating), and subsequent work-up (extraction, filtration, etc.). In conclusion, while these classifications provide insight, the true measure of a prediction’s validity can only be ascertained through empirical laboratory testing.
8. Conclusions
The aim of this paper is to systematically compare machine learning algorithms, which could interpret and translate SMILES molecular notation into detailed procedural texts of synthesis reactions. A curated dataset of 1200 esterification reactions was prepared specifically for our task and shared publicly. The approach in this paper encompasses a set of fine-tuned generative transformer models (based on FLAN-T5, GPT-3.5-turbo, davinci-002, and babbage-002) and a traditional memory-based algorithm (TF-IDF+kNN) for the prediction of esterification procedures. The FLAN-T5 model emerged as the top performer (with BLEU—51.82), closely followed by GPT-3.5-turbo (BLEU—47.99). The TF-IDF+kNN algorithm, especially using the Levenshtein distance metric, is a good alternative for generative transformer models for our problem. The results illustrate the capabilities of fine-tuned LLMs to be used in the field of chemical synthesis procedure planning and optimization. The novelty of this research stems from the diversity of methodologies examined, highlighted by the first instance of the implementation of a fine-tuned GPT-3.5-turbo model to predict chemical synthesis procedures for esterification reaction, which encompasses 28 distinct actions. The findings of this study apply to AI researchers and chemists who could utilize various datasets and LLM fine-tuning techniques to create specific solutions that are relevant to particular reaction classes. As the field advances, we anticipate an increase in the availability of services tailored for LLM tuning and creation, making these powerful tools more accessible to scientists without programming expertise. As a natural progression, our future research will focus on: (1) further probing into sequence-centric models, such as GPT transformers and other versions of T5 models, given their demonstrated efficacy. Examining the potential for these models to capture more complex reaction mechanisms and possibly predict multi-step synthesis procedures is essential; (2) augmenting our dataset to encompass a broader range of organic reactions and potentially multiple classes of reactions. Such tests measure models’ generalizability across various types of chemical reactions; and (3) validating our models in real-world laboratory settings to discern their practical utility and reliability. Through these empirical trials, we would hope to refine the models based on real-world feedback.