Generative LLMs in Organic Chemistry: Transforming Esterification Reactions into Natural Language Procedures

Vaškevičius, Mantas; Kapočiūtė-Dzikienė, Jurgita; Šlepikas, Liudas

doi:10.3390/app132413140

Open AccessArticle

Generative LLMs in Organic Chemistry: Transforming Esterification Reactions into Natural Language Procedures

by

Mantas Vaškevičius

^1,2,*

,

Jurgita Kapočiūtė-Dzikienė

¹

and

Liudas Šlepikas

²

¹

Department of Applied Informatics, Vytautas Magnus University, LT-44404 Kaunas, Lithuania

²

JSC Synhet, Biržų Str. 6, LT-44139 Kaunas, Lithuania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(24), 13140; https://doi.org/10.3390/app132413140

Submission received: 10 November 2023 / Revised: 29 November 2023 / Accepted: 6 December 2023 / Published: 11 December 2023

(This article belongs to the Special Issue Applications of Deep Learning and Artificial Intelligence Methods)

Download

Browse Figure

Versions Notes

Abstract

:

This paper presents a novel approach to predicting esterification procedures in organic chemistry by employing generative large language models (LLMs) to interpret and translate SMILES molecular notation into detailed procedural texts of synthesis reactions. Esterification reaction is important in producing various industrial intermediates, fragrances, and flavors. Recognizing the challenges of accurate prediction in complex chemical landscapes, we have compiled and made publicly available a curated dataset of esterification reactions to enhance research collaboration. We systematically compare machine learning algorithms, ranging from the conventional k-nearest neighbors (kNN) to advanced sequence-to-sequence transformer models, including FLAN-T5 and ChatGPT-based variants. Our analysis highlights the FLAN-T5 model as the standout performer with a BLEU score of 51.82, suggesting that the model has significant potential in enhancing reaction planning and chemical synthesis. Our findings contribute to the growing field of AI in chemistry, offering a promising direction for enhancing the efficiency of reaction planning and chemical synthesis.

Keywords:

deep learning; large language model; ChatGPT; LLM; machine learning; esterification; synthesis procedure; organic synthesis; procedure prediction; FLAN-T5

1. Introduction

Esterification is a fundamental chemical reaction that is used in the synthesis of esters from acids and alcohols [1,2]. This reaction is important in producing not only various industrial intermediates but is also widely applied in fragrances and flavours [3]. The outcome of an esterification reaction is often dictated by the procedural steps that are followed. Precise procedures, encompassing parameters like the choice of solvents, temperature, and duration, are important to achieve the desired product with a high yield and purity [4,5]. Such procedures serve as a plan for chemists, guiding them through the process of organic synthesis, ensuring reproducibility, and minimizing the chances of unwanted side products. However, predicting the optimal procedure for a given set of reactants remains a challenge, often requiring iterative experimentation with various chemical reagents (acid-catalyzed Fisher esterification, Steglich esterification, etc.). Recent advancements in deep learning (DL) have demonstrated its potential in modeling chemical properties and reactions [6,7,8,9]. In addition, a shift in computational research methodologies towards viewing chemistry as a text-to-text task signifies a new perspective in the domain [10]. By treating chemical reactions as sequences, like sentences in natural language processing, researchers can utilize machine learning models, originally designed for language translation, to predict chemical outcomes. This novel perspective allows for advanced predictions, optimizations, and innovations in the chemical domain using linguistic models for scientific advancement. Leveraging large language models (LLMs) that have been pre-trained on vast datasets inherently grants the capacity to understand and generate chemical text to a certain degree. For generative text-to-text tasks, such models can be adapted to chemical contexts, bridging the gap between language processing and chemistry [11]. Building upon this, our research aims to develop an in silico methodology to predict accurate procedures for esterification reactions. We utilize a dataset of esterification reactions to fine-tune LLMs and then test the performance of the models. Using the methodology presented in this paper, chemists may reconsider their synthesis strategies, ultimately optimizing their reaction conditions before initiating the actual reaction. This predictive approach promises tangible benefits, such as increased efficiency and substantial savings in terms of time and resources.

2. Related Work

Predicting the optimal procedures for organic reactions, including esterification, is a complex task due to several factors. Firstly, esterification is an umbrella term for a foundational reaction in organic chemistry, which is responsible for forming esters from diverse processes [12]. The Fischer esterification, a classical example, involves the transformation of a carboxylic acid and an alcohol with an acid catalyst [13]. A more contemporary technique is the Steglich esterification, which engages carbodiimides, like EDAC (1-Ethyl-3-(3-dimethylaminopropyl)carbodiimide) and DMAP (4-Dimethylaminopyridine), to expedite the interaction between carboxylic acids and alcohols [14]. Additionally, the Mitsunobu reaction, which deploys DEAD (Diethyl azodicarboxylate) along with triphenylphosphine, offers an alternative for the synthesis of esters from primary and secondary alcohols [15]. The solvent’s role in esterification reactions cannot be understated, dictating aspects like the equilibrium and reaction speed. Familiar solvents, such as DMF (Dimethylformamide), DCM (Dichloromethane), and THF (Tetrahydrofuran), are usually used in these reactions [16,17]. Historically, chemists have relied on empirical knowledge, documented procedures, and iterative experimentation to determine the best conditions for a given synthesis [18]. However, with the increasing complexity of organic molecules containing various functional groups and the need for efficient and sustainable synthesis, there is a growing demand for computational tools that can predict the procedures and parameters of reactions. Several computational methods have been developed to predict reaction conditions, utilizing databases of known reactions. These methods, while promising, often require extensive computational resources and may not always provide accurate predictions for novel or less-studied reactions [19]. The evolution of computational methods has seen a pivot toward using machine learning and artificial intelligence to interpret reaction datasets. These techniques excel at discerning patterns within the data, thereby possibly improving the predictions for novel reactions. Contrasting with previous approaches, such as the one employing a transformer-based sequence-to-sequence model which attained a BLEU score of 54.7 through text-based representations [20], our research undertakes a similar predictive task but diverges in several key aspects. We introduce a novel and specific dataset for the esterification reaction, employ an alternative procedural notation, and utilize a distinct linguistic framework. Furthermore, our methodological innovation is in processing only the SMILES representations of the molecules that are implicated in the reactions, focusing the input specifically on the transformative elements of the reactions.

Deep learning (DL) methodologies have shown remarkable success in various chemistry tasks [21]. Recent advancements in transformer architectures have significantly impacted molecular generation in drug discovery. Noteworthy models such as MolBERT, ChemGPT, T5Chem, MolFormer, Chemformer, and BARTSmiles, which employ NLP techniques, demonstrate this influence [22,23,24,25,26,27]. For example, Chemformer and T5Chem are pre-trained on extensive SMILES strings from ZINC-15 [28] and Pubchem, respectively, and are fine-tuned for chemical prediction tasks. Additionally, graph transformers, such as RetroExplainer, have been used in retrosynthesis, automating organic chemistry processes through a transparent, deep learning-guided molecular assembly process [29]. Further broadening the scope of DL methodologies, Galactica has been trained on diverse scientific data, including chemical information, with a character-based approach to SMILES tokenization [30]. A novel contribution to this field is nach0, an encoder-decoder LLM pre-trained on the scientific literature and molecule strings, and excelling in various chemical and biological tasks with its ability to generate high-quality molecular and textual outputs [31]. This model stands out for its performance in both single and cross-domain tasks, surpassing existing models in efficiency and output quality. Despite these advancements, specific research on GPT-3.5 and GPT-4 in chemistry applications like reaction prediction and retrosynthesis remains limited [32]. While GPT models initially lagged behind existing ML baselines, partly due to challenges in interpreting molecular SMILES strings, recent developments in fine-tuned GPT-3.5-turbo models have shown promise. These models outperform traditional transformers, including the T5 and Llama2-13b-chat [33] models, particularly in extracting action sequences from experimental procedures [34]. OpenAI’s models, such as davinci-002 and GPT-3.5-turbo, are competent at chemistry questions with reasonable prompts [35]. Even earlier models, such as GPT-3, have been shown to perform impressively well for a wide range of questions about chemistry [36]. LLMs have also been used to power ChemCrow, an innovative method that integrates computational tools with chemistry, showcasing its capability in planning syntheses and solving various chemical reasoning tasks, from simple drug discovery to intricate molecular design [37]. Similarly, a GPT-4 model was utilized to create a multi-LLM agent that can autonomously design, plan, and execute complex scientific experiments [38].

In this paper, we test different methods that predict the procedure of a reaction, which consists of a sequence of formally described actions, such as Add, Heat, Extract, Crystallize, and parameters that are associated with the actions, such as the temperature, duration, solvents, and catalysts. We use a dataset of 1200 reactions and test the k-nearest neighbors (kNN) algorithm, fine-tuned OpenAI models (GPT-3.5-turbo, davinci-002, babbage-002), and a fine-tuned FLAN-T5 model. The contribution of this research is: (1) the pioneering use of the fine-tuned GPT-3.5-turbo model to predict chemical synthesis procedures for esterification reaction, which encompasses an extensive array of actions (28 distinct actions). (2) A distinctive aspect of our approach is the exclusion of ancillary compounds from the model inputs—only reactants and products are provided, and we deliberately omit any non-reaction-specific agents such as gases, solvents, and catalysts, including EDAC or DMAP. The complexity of the task is increased as the model cannot depend exclusively on the input for forecasting all the requisite steps and parameters, encompassing the ancillary compounds. Nonetheless, the output is significantly more useful to the researcher because, particularly with novel compounds lacking extensive synthesis documentation, only the reactants and products are typically predetermined before initiating laboratory experiments. Our study conducts a comprehensive comparison between cutting-edge artificial intelligence models and conventional algorithms, setting a standard for subsequent inquiries in this domain. Consequently, our research presents an innovative perspective on reaction planning, highlighting the synergy between LLMs and chemical processes.

3. Formal Definition of Tasks

In the paper, a generative text-to-text problem is solved. Given a source chemical reaction description, r = (r₁, r₂, …, r_n), in SMILES notation (e.g., reactant.reactant >> product), the task is to generate a target procedure description, p = (p₁, p₂, …, p_m), in a formal, machine-readable format that conveys the specific steps, conditions, and parameters for the described reaction. Let R be the space of all possible source reaction descriptions in natural English language notation and P be the space of all possible target procedure descriptions in the formal, machine-readable format. In our case, P is restricted to esterification reactions—a category of synthesis reactions of esters from alcohols and acids. Let Θ be an ML algorithm that could learn a function, ϕ(R)→P, which maps a source reaction description to its corresponding target procedure description.

The goal of Θ is to learn an approximation (denoted as ϕ) of the function ϕ from a training dataset D_R ⊂ R, where each source reaction description, r, in D_R has a corresponding target procedure description, p, in the formal format. The learned function ϕ is evaluated on a separate testing dataset, D_T ⊂ R, which consists of reaction descriptions that have not been seen during the training phase. Finally, the model’s performance is evaluated based on how similar the predictions are to the target procedures using an objective evaluation metric.

4. The Data

The dataset utilized in our experiments is derived from a comprehensive set of synthesis procedures found in USPTO and EPO patents issued between 1971 and 2022 [39]. The dataset was constructed through a unique methodology proposed in this article. This methodology combines machine learning algorithms and scripts to systematically extract and transform experimental procedures from these patents into structured actions. The pipeline involves two primary tasks: firstly, classifying patent paragraphs to accurately identify chemical procedures, and secondly, converting these procedures into a structured format. The datasets differs from the commonly used USPTO-50k [40] or USPTO-MIT [41] because it includes both reactants and products in SMILES format along with the synthesis procedures, which are in a simplified and machine-readable format. The second version of the publicly available dataset has been used, because it has been additionally improved by the removal of irregular action sequences. While the primary dataset encompasses various reaction classes, we employed the open-source software DataWarrior 5.5.0 [42] to isolate only esterification reactions. The selection of esterification reactions for isolation from the broader dataset was done because of the scope of the raw data, which include millions of instances, necessitating a focused subset for detailed analysis. Esterification, despite being a common class of chemical reaction, encompasses a variety of subtypes, each with distinct procedural steps, rendering the task of accurate procedure prediction challenging. This choice permits an in-depth exploration of a well-defined reaction type within a manageable dataset size, enabling the development and refinement of the ML algorithm for complex, real-world applications. The refined dataset comprises pairs of input and output instances. The inputs are represented as single lines of text, denoting reactants and products in SMILES (Simplified Molecular Input Line Entry System [43], used in chemistry to represent chemical structures simply and unambiguously) notation, whereas the outputs describe a series of actions and their respective parameters. The actions are limited and represented in a structured and simplified format: a solitary word signifies the action, succeeded by its specific parameters. The actions are: Add, CollectLayer, Concentrate, Degas, DrySolid, DrySolution, Extract, Filter, FollowOtherProcedure, MakeSolution, Microwave, OtherLanguage, Partition, PH, PhaseSeparation, InvalidAction, Purify, Quench, Recrystallize, NoAction, Reflux, SetTemperature, Sonicate, Stir, Triturate, Wait, Wash, and Yield. The schema employed for action, parameter naming, and formatting was initially introduced by Lowe and subsequently refined by IBM, and it currently stands as the most exhaustive for this task. For our application, reactants and products have been tokenized and are denoted by $R1$, $R2$, …, $RN$ for reactants and by $P1$, $P2$, …, $PN$ for products. Such tokenization is efficient, does not require the reaction compounds to be copied over to the resulting procedure, and avoids potential errors in the notation. The input is case-sensitive due to the nature of SMILES notation, whereby aromatic atoms are denoted in lower-case letters, while aliphatic atoms are denoted in upper-case letters. The output is also case-sensitive because it makes it easier to discern between action names, compound names, abbreviations, and parameters (temperature, duration, etc.). Computer code can be used to extract relevant information from the procedures to conduct analysis and potentially apply them to a variety of robotic synthesizers. Therefore, the syntax is strict and any minor spelling errors in the words cause the whole word to be considered incorrect. A sample from the dataset is available in Table 1.

The dataset comprises a total of 1200 samples. The data have been checked manually and corrected by a knowledgeable chemist where necessary. While the input data remained predominantly unaltered, the output data saw modifications in instances of duplicated identical actions or instances of incongruent solvent nomenclatures, such as the replacement of “iso propanol” with the correct “isopropanol”. The dataset was shuffled and partitioned into test (100 samples), validation (100 samples), and training sets (1000 samples). To evaluate the influence of the training dataset size on the algorithm performance, additional training subsets of 500, 150, and 50 samples were created. Notably, the 150-sample dataset encompasses all instances present in the 50-sample set, and similarly, the 500-sample set contains all from the 150-sample set. The exploration of smaller training subsets within our dataset is based on various considerations. In other reaction classes beyond esterification, the number of reactions available may not be as extensive; therefore, it is meaningful to simulate conditions where data scarcity is a factor. Furthermore, there are logistical constraints such as the high costs associated with chemical experiments, limitations in existing data repositories, and the significant effort required to cleanse data of noise, which may impede the collection of large training sets. This reality underscores the value of testing the algorithm’s performance with reduced sample sizes. Additionally, this approach is utilized to assess the sufficiency of smaller datasets for model training, leveraging the repetitive nature of sentence structures and recurrent action patterns within procedural texts, which could allow for effective learning and generalization from fewer examples.

The input character sequences span a range from 22 to 293 characters in length. The average input lengths across subsets were consistent: validation (85.83 characters), testing (92.38 characters), and training sets of 1000 (87.09 characters), 500 (84.92 characters), 150 (83.17 characters), and 50 samples (82.24 characters). The synthesis procedures (output) ranged from a minimum of 4 actions to a maximum of 29. The average action counts in the outputs across the subsets were also uniform: validation (12.07 actions), testing (11.68 actions), and training sets of 1000 (12.26 actions), 500 (12.34 actions), 150 (11.91 actions), and 50 samples (11.42 actions). Such consistency indicates that the training, validation, and testing subsets are representative, ensuring reliable test scores and confident conclusions regarding the task. The data are presented in Table 2. The full dataset is publicly available in the project repository online: https://github.com/Mantas-it/LLM_Esterification (accessed on 9 November 2023).

To establish a baseline for the trained models, random and majority methods were selected. The random method outputs a random output from the training dataset, while the majority method simply outputs the most common sentence (in the training dataset). The BLEU score is calculated using the entire training dataset, which consists of 1000 examples. The majority and random baselines yield BLEU scores of 24.16 and 36.56, respectively. The test results are later compared to these baselines to determine if the fine-tuned models perform better. If the results surpass these baselines, it will suggest that the fine-tuned models possess sufficient predictive capability to be considered as potential solutions for our problem. In such a case, the next step would be to identify the one that demonstrates the optimal performance among all of the tested approaches.

5. Applied Machine Learning Approaches

In the field of computational chemistry, the critical task of translating molecular representations into a machine-readable format is a prerequisite for the application of machine learning techniques. This transformation is crucial, especially given that the methodologies we evaluate necessitate a numerical input. Therefore, such inputs require vectorization. The utility of supervised machine learning methodologies is well-established in the domain of predictive modelling, particularly in scenarios where there is a clear mapping from input to output. In the context of our study, which aims to predict organic chemistry procedures, the supervised learning paradigm serves as an appropriate framework for training models on the curated dataset. Consequently, we have explored several vectorization approaches and tested three machine learning algorithms, each accompanied by a selection of hyper-parameters to optimize their performance for our task.

FLAN-T5 Model. The model FLAN-T5-base (created by Google) [44] is an evolved variant of the T5 [45] architecture that can vectorize SMILES strings into contextually rich representations that may capture the chemical semantics embedded within them. The T5′s foundational transformer architecture employs the SentencePiece tokenizer [46], which breaks down words into subwords or tokens. Each of these tokens is then vectorized into context vectors of varying lengths, depending on the model scale used. In our case, context vectors are refined within the token’s vicinity, a process intrinsic to the T5 learning paradigm, where embeddings are adjusted to reflect the token’s contextual relevance. These vectors, once concatenated, maintain the discrete boundaries between tokens while forming a comprehensive sequence representation [47]. For SMILES notations, the tokenization may differ from natural language processing, potentially involving individual characters as tokens. Regarding the sequence and context vector lengths, FLAN-T5-base uses a 512-element context vector with an embedding dimension of 786. The maximum output length was also set to 512. This FLAN-T5 model was further fine-tuned on a domain-specific dataset, focusing on esterification reactions. The fine-tuning process was not conducted from scratch but was based on models that were pre-trained by Google. The FLAN-T5-base model was fine-tuned using Hugging Face’s library [48]. Predominantly, a text-to-text transformer model is premised on the idea that most NLP tasks can be framed as a text-to-text problem, where both the input and output are sequences of text [49,50,51]. The model leverages an encoder-decoder structure, using a stack of self-attention mechanisms. Fine-tuning is important to adapt the generalized knowledge of pre-trained models to specific, often narrower, domains or applications. This process usually involves additional training epochs on a target dataset, allowing the model to specialize and achieve higher performance metrics in specific tasks. The model used in this study has been fine-tuned with different sizes of datasets, with a learning rate of 0.00005, which has been found to lead to optimal results at around epoch 60 in most cases. A batch size of 4 has been used with four gradient accumulation steps, resulting in an effective batch size of 16.

OpenAI’s GPT models. OpenAI’s GPT models utilize an architecture known as transformers, which are pivotal for their language generation capabilities [52,53]. The core idea behind these models is the use of embeddings, similar to FLAN-T5 embeddings, to convert words or tokens into continuous vector spaces. The models utilize attention mechanisms, particularly self-attention, to weigh the importance of different words in a sentence, allowing them to generate coherent and contextually relevant text. The larger the model, GPT-3 or GPT-4 being examples of large models, the more capacity it has to store relationships in its embeddings. OpenAI provides three models for fine-tuning (GPT-3.5-turbo (proprietary information), davinci-002 (12288 embedding dimensions), and babbage-002 (2048 embedding dimensions)). These models are part of the GPT series of models, which are large-scale language models designed to generate text. The models named Davinci and GPT-3.5-turbo are among the largest models in the GPT-3 series by OpenAI, while babbage-002 is a smaller variant of the GPT-3 series. The GPT-3.5-turbo model can include a system message, which after testing in OpenAI’s playground was set to “Write a very concise procedure given the reactants and products, esterification reaction. Use one-word actions and precise temperatures and durations. Skip measurements”. The system message has very concrete instructions and results in a procedure being produced with simple words and parameters. It is important to note that, during testing, the model’s predictions before fine-tuning were very general and did not match our dataset at all. Therefore, we did not even perform the objective evaluation. This suggests that the current model (despite the version: GPT-3.5-turbo, davinci-002, babbage-002, etc.) is not able to make a reasonable guess and does not contain enough knowledge to create one. While it is known to have some initial understanding of organic chemistry, our tasks appeared to be too specific. A hyper-parameter temperature (affective randomness of the models’ generated output) was set to 0.05, as this was found to be the most reasonable for fine-tuned models, resulting in mostly deterministic and reasonable predictions, and has been shown to work best with custom prompts [35]. All models have been trained separately for three and six epochs as the values higher than six resulted in the model’s overfitting and degraded the performance scores.

TF-IDF+kNN. The Term Frequency-Inverse Document Frequency (TF-IDF) vectorization method is utilized for textual data to represent the significance of terms within a given text or document. The TF-IDF can be adjusted to perform the character-level vectorization of SMILES strings (e.g., in [54,55]). For our application, the vectorizer was set up to be case-sensitive to accommodate SMILES notation. In its foundational sense, the TF-IDF calculation reflects the significance of a character in a reaction relative to its frequency across multiple reaction notations in SMILES [56]. For our method, the vectorizer inherently ranks characters based on their significance in the training dataset, potentially highlighting frequently occurring chemical motifs or functional groups in the reactions. Although TF-IDF offers a robust means of vectorization, it is essential to note that its efficacy is dependent on the nature of the data and the application. Although this popular vectorization method is the simplest among all of our tested approaches, the dataset input contains a limited number of terms within esterification reactions. Consequently, the TF-IDF may already be sufficient for our solving problem or at least serve as an alternative baseline. The vectorization has been performed for the TF-IDF matrix using TfidfVectorizer from the scikit-learn Python module [57], using our prepared dataset. The kNN algorithm is a relatively simple memory-based approach [58], with the training phase involving the straightforward storage of training instances. Specifically, the kNN methodology involves identifying the k (in our case, k = 1) instances in the training set that are closest to the tested one. The tested instance obtains the textual output of the closest one. The “closeness” is determined by the distance (Euclidean [59], Levenshtein [60]) or similarity (cosine [61]) metrics. All three of these metrics were experimentally investigated with our dataset, resulting in the best performance with the Levenshtein metric.

6. Results

The following experiments were performed with all subsets (described in Section 4), using the vectorization methods and supervised machine learning algorithms (in Section 5). The models were tested by comparing the true values from the testing dataset to the generated ones. The BLEU (Bilingual Evaluation Understudy) score metric was used. The BLEU score, originally developed for evaluating machine translation quality, has become a valuable metric in today’s diverse NLP tasks for assessing the quality of generated text [62]. It offers a numerical measure of how closely the generated text matches the reference.

The BLEU score calculates the geometric mean of modified n-gram (N = 4) precision (P_n) with a brevity penalty (BP) which helps to mitigate the issue of shorter generated texts compared to the reference:

B L E U = B P \times e x p (\sum (\frac{\log (P n)}{N}))

(1)

where N is the maximum order of n-grams considered, and the brevity penalty (B_P) is defined as:

B P = \{\begin{matrix} \begin{matrix} 1 & i f c > r \end{matrix} \\ \begin{matrix} e^{(1 - r / c)} & i f c \leq r \end{matrix} \end{matrix}

(2)

The best results for each model and dataset size are illustrated in Figure 1. For each model, different sizes of the training datasets that were used to fine-tune the model (for example—size 50, size 150, etc.) are displayed in different colors. A full table of our results can be found in Appendix A.

7. Discussion

Upon closer observation, one can note the correlation between the data size and performance in all of the tested approaches. The increased performance observed from training on from 50 to 1000 data points indicates that all of the tested deep learning models, particularly FLAN-T5, benefit significantly from larger datasets. The only exception is davinci-002 (between size 50 and size 150); however, the difference is minimal. Such an observation suggests that larger datasets are necessary for even better results. The FLAN-T5 model benefits the most from the increased size of the training dataset (a difference in score of 12.48 between size 50 (39.34) and 1000 (51.82)), while the GPT-3.5-turbo saw the least amount of improvement (6.45) between size 50 (41.54) and 1000 (47.99). This also suggests that the larger models need even more data to benefit from the size of the training dataset. Considering the nature of chemistry, each input is often unique, because different products are being synthesized; therefore, models must deal with a significant variety of text. It has been shown that few-shot learning can be applied to a variety of tasks; however, contrasting our results with the English language, where recurrent phrases can assist in semantic extraction, the challenges in chemistry text processing are evident [63]. Therefore, any discussion of results in the following section will consider only training on the largest dataset.

Compared to the random and majority baseline scores (random—36.56, majority—24.16), all of the tested methods with all sizes of training datasets demonstrate a reasonable performance and predictive power. Looking at Table 3, the FLAN-T5 model has emerged as the best among all of the tested approaches, achieving the top BLEU score of 51.82, which can be considered as high-quality translation. One can attribute this to the sequence-to-sequence nature of FLAN-T5, which makes it well-suited for translating chemical reactions from reactants and products to procedures. Despite this, the OpenAI models (particularly GPT-3.5-turbo) underperform compared to FLAN-T5 but also deliver promising results. With a BLEU score only marginally less than FLAN-T5 (47.99), GPT-3.5-turbo showcases the versatility of transformer architectures, which, although initially designed for language processing, can learn to understand the intrinsic structure and relationships in molecular notation. Of course, structured, but more natural, language in the output makes the task a little easier, as models consider all of their context (including the generated content), but the input remains complicated. Paradoxically, despite starting from a lower position (i.e., lacking the ability to generate any procedures, even very rudimentary ones, and not having demonstrated proficiency in answering questions of a chemical nature), the FLAN-T5-base model (with 250 M parameters) was able to surpass the GPT-3.5-turbo (154 billion parameters) (which was capable of generating very basic and general procedures for synthesis reactions). GPT-3.5-turbo possessed extensive language knowledge, which may have overlapped with its understanding of chemical knowledge, potentially leading to ambiguity. In contrast, FLAN-T5 gathered all of the necessary information primarily from the training dataset. This difference may help explain why FLAN-T5 outperformed GPT-3.5-turbo.

When comparing the GPT models with each other, GPT-3.5-turbo and davinci-002 (with 6 billion parameters) achieved very similar results, especially with the largest dataset, and both outperformed babbage-002 (125 million parameters), which is the smallest of the three. There seems to be a clear correlation between a GPT model’s complexity and its ability to adapt effectively to novel problems it has never encountered before.

Interestingly, the TF-IDF+kNN algorithm is not the worst, i.e., it was able to outperform generative transformer babbage-002. The highest BLEU score of 46.69 was achieved with the Levenshtein distance metric using the largest training dataset. Its underlying principle of using similar reactions to predict procedures seems aligned with the task at hand. While TF-IDF+kNN is a relatively simple algorithm, its performance, which closely rivals that of some transformer-based methods in this domain, is truly remarkable.

The evaluated methods have the potential of both traditional and state-of-the-art models in processing and understanding chemical reactions. The FLAN-T5-base model stands out among deep learning models with its superior BLEU score, which can be attributed to its sequence-to-sequence architecture. However, the second-best, OpenAI’s GPT-3.5-turbo, while not outperforming FLAN-T5-base, also demonstrates its adaptability. One of the key takeaways from the results is that fine-tuned OpenAI models can be outperformed by a significantly smaller model having no knowledge about chemistry, which can be fine-tuned on a single workstation.

A feature of the FLAN-T5-base model is its autonomous nature, serving as an independent, self-contained solution that does not require reliance on third-party services. Furthermore, this model is accessible free of charge, enhancing its appeal for researchers and institutions looking to perform research without incurring additional costs. This aspect reinforces the model’s value as a practical tool for scientific inquiry where budget constraints are a consideration. Our results demonstrate that models for specific tasks need to be tested on many different types and sizes of model to discern the top performers. Furthermore, the positive outcomes provide a strong impulse for further exploration in this area, including the utilization of larger datasets that encompass a broader range of reaction types and the exploration of various models. Moreover, there exist numerous other large language models, and new ones continue to emerge regularly. In addition, specific findings, such as the superior performance of the FLAN-T5 model (despite its initial limited chemistry knowledge) compared to GPT-3.5-turbo, raise further questions. There might be a need for a model with a clear separation between chemistry-specific input and structured yet natural language text generation in the output. This could also be an interesting direction for future research.

In conducting our error analysis, the performance of the fine-tuned FLAN-T5-base model was evaluated using prediction examples from the testing dataset. The analysis of these errors is crucial, as reliance solely on quantitative metrics can be misleading, particularly when dealing with smaller datasets where nuanced discrepancies may not be adequately captured [64]. The examples obtained in this study, detailed in Table 3, fall into three distinct categories: (1) technically correct, (2) partially correct, and (3) incorrect. It was observed that none of the model’s predictions exactly replicated the original sentences, a result that aligns with expectations given the average action count of 12 in the procedures. For a more nuanced understanding, the predictions were segmented into three classifications. Technically correct predictions are deemed those with minor errors that did not compromise the procedural integrity. An exemplar case in Table 3 is depicted in highlighted sections, where the action MakeSolution—an action to combine chemical components—was shown with the compounds in an alternative sequence. Since there is no strict ordering for compound identifiers in such actions, this deviation is inconsequential to the procedural outcome. In similar instances, the repeated use of the action Add instead of MakeSolution yielded technically equivalent procedures, accounting for approximately 10% of the test dataset. Most predictions, ranging between 70 and 80%, were categorized as partially correct. These predictions deviated in certain aspects, like action sequence or parameters, from the true procedures. An illustrative instance in Table 3 shows discrepancies in the temperature and duration for the Stir action, along with differences in the description of adding ethyl acetate, sodium bicarbonate, and water. The model’s prediction included assumptions about partitioning and layer collection, along with an extra washing step with brine, diverging from the original procedure. Such errors, while not invalidating the procedures, vary in their impact: some may hinder the synthesis process, while others could enhance clarity or efficiency. Finally, the incorrect category, encompassing 10–20% of the instances, as exemplified in Table 3, includes predictions with significant deviations from the true procedures. However, this does not inherently imply laboratory ineffectiveness, as these predictions often follow a general pattern of adding chemicals, processing (waiting, stirring, heating), and subsequent work-up (extraction, filtration, etc.). In conclusion, while these classifications provide insight, the true measure of a prediction’s validity can only be ascertained through empirical laboratory testing.

8. Conclusions

The aim of this paper is to systematically compare machine learning algorithms, which could interpret and translate SMILES molecular notation into detailed procedural texts of synthesis reactions. A curated dataset of 1200 esterification reactions was prepared specifically for our task and shared publicly. The approach in this paper encompasses a set of fine-tuned generative transformer models (based on FLAN-T5, GPT-3.5-turbo, davinci-002, and babbage-002) and a traditional memory-based algorithm (TF-IDF+kNN) for the prediction of esterification procedures. The FLAN-T5 model emerged as the top performer (with BLEU—51.82), closely followed by GPT-3.5-turbo (BLEU—47.99). The TF-IDF+kNN algorithm, especially using the Levenshtein distance metric, is a good alternative for generative transformer models for our problem. The results illustrate the capabilities of fine-tuned LLMs to be used in the field of chemical synthesis procedure planning and optimization. The novelty of this research stems from the diversity of methodologies examined, highlighted by the first instance of the implementation of a fine-tuned GPT-3.5-turbo model to predict chemical synthesis procedures for esterification reaction, which encompasses 28 distinct actions. The findings of this study apply to AI researchers and chemists who could utilize various datasets and LLM fine-tuning techniques to create specific solutions that are relevant to particular reaction classes. As the field advances, we anticipate an increase in the availability of services tailored for LLM tuning and creation, making these powerful tools more accessible to scientists without programming expertise. As a natural progression, our future research will focus on: (1) further probing into sequence-centric models, such as GPT transformers and other versions of T5 models, given their demonstrated efficacy. Examining the potential for these models to capture more complex reaction mechanisms and possibly predict multi-step synthesis procedures is essential; (2) augmenting our dataset to encompass a broader range of organic reactions and potentially multiple classes of reactions. Such tests measure models’ generalizability across various types of chemical reactions; and (3) validating our models in real-world laboratory settings to discern their practical utility and reliability. Through these empirical trials, we would hope to refine the models based on real-world feedback.

Author Contributions

Conceptualization, J.K.-D. and M.V.; software, M.V.; writing—original draft preparation, M.V.; writing—review and editing, J.K.-D. and L.Š. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets and models used in this paper can be found: https://github.com/Mantas-it/LLM_Esterification (accessed on 9 November 2023).

Conflicts of Interest

Authors Mantas Vaškevičius and Liudas Šlepikas were employed by the company JSC SynHet. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Table A1 presents the outcomes of all experiments conducted using various methodologies.

Table A1. Full table of results.

Approach	Data Size	Epochs	BLEU
gpt-3.5-turbo fine-tuning	50	1	8.9805
	50	2	28.0124
	50	3	41.5366
	50	4	40.5646
	50	5	39.7529
	50	6	41.5301
	50	7	39.0743
	50	9	36.7705
	150	3	34.5218
	150	6	42.814
	500	3	44.5179
	500	6	45.6222
	1000	3	47.7161
	1000	6	47.9855
davinci-002 fine-tuning	50	3	33.7223
	50	6	40.3091
	150	3	38.7308
	150	6	40.2609
	500	3	47.3876
	500	6	44.6208
	1000	3	46.9853
	1000	6	47.6892
babbage-002 fine-tuning	50	3	31.2137
	50	6	35.2184
	150	3	32.4681
	150	6	38.0918
	500	3	40.8444
	500	6	41.4988
	1000	3	42.5226
	1000	6	42.6769
FLAN-T5 fine-tuning	50	6	39.3357
	150	8	40.3078
	500	36	49.7352
	1000	67	51.8222
kNN (Levenstein)	50	-	37.6851
	150	-	40.9809
	500	-	44.5811
	1000	-	46.6903

References

Khan, Z.; Javed, F.; Shamair, Z.; Hafeez, A.; Fazal, T.; Aslam, A.; Zimmerman, W.B.; Rehman, F. Current Developments in Esterification Reaction: A Review on Process and Parameters. J. Ind. Eng. Chem. 2021, 103, 80–101. [Google Scholar] [CrossRef]
Turhanen, P.A.; Leppänen, J.; Vepsäläinen, J.J. Green and Efficient Esterification Method Using Dried Dowex H+/NaI Approach. ACS Omega 2019, 4, 8974–8984. [Google Scholar] [CrossRef] [PubMed]
Yadav, G.D.; Mujeebur Rahuman, M.S.M. Synthesis of Fragrance and Flavour Grade Esters: Activities of Different Ion Exchange Resins and Kinetic Studies. Clean Technol. Environ. Policy 2003, 5, 128–135. [Google Scholar] [CrossRef]
Yan, S.; Tong, T.; Li, Y.; Khan, S.U.; Zhao, J.; Wang, S.; Wang, X. Production of Biodiesel Through Esterification Reaction Using Choline Exchanging Polytungstoboronic Acids as Temperature-Responsive Catalysts. Catal. Surv. Asia 2017, 21, 151–159. [Google Scholar] [CrossRef]
de Nazaré de Oliveira, A.; Ferreira, I.M.; Jimenez, D.E.Q.; Neves, F.B.; Soares da Silva, L.; Farias da Costa, A.A.; Lima, E.T.L.; de Oliveira Pires, L.H.; Ferreira da Costa, C.E.; Narciso da Rocha Filho, G.; et al. An Efficient Catalyst Prepared from Residual Kaolin for the Esterification of Distillate from the Deodorization of Palm Oil. Catalysts 2021, 11, 604. [Google Scholar] [CrossRef]
Mater, A.C.; Coote, M.L. Deep Learning in Chemistry. J. Chem. Inf. Model. 2019, 59, 2545–2559. [Google Scholar] [CrossRef] [PubMed]
Shilpa, S.; Kashyap, G.; Sunoj, R.B. Recent Applications of Machine Learning in Molecular Property and Chemical Reaction Outcome Predictions. J. Phys. Chem. A 2023, 127, 8253–8271. [Google Scholar] [CrossRef]
Singh, S.; Sunoj, R.B. Molecular Machine Learning for Chemical Catalysis: Prospects and Challenges. Acc. Chem. Res. 2023, 56, 402–412. [Google Scholar] [CrossRef]
Grisoni, F. Chemical Language Models for de Novo Drug Design: Challenges and Opportunities. Curr. Opin. Struct. Biol. 2023, 79, 102527. [Google Scholar] [CrossRef]
Schwaller, P.; Gaudin, T.; Lányi, D.; Bekas, C.; Laino, T. “Found in Translation”: Predicting Outcomes of Complex Organic Chemistry Reactions Using Neural Sequence-to-Sequence Models. Chem. Sci. 2018, 9, 6091–6098. [Google Scholar] [CrossRef]
Jablonka, K.M.; Ai, Q.; Al-Feghali, A.; Badhwar, S.; Bocarsly, J.D.; Bran, A.M.; Bringuier, S.; Brinson, L.C.; Choudhary, K.; Circi, D.; et al. 14 Examples of How LLMs Can Transform Materials Science and Chemistry: A Reflection on a Large Language Model Hackathon. Digit. Discov. 2023, 2, 1233–1250. [Google Scholar] [CrossRef] [PubMed]
Zeng, Z.; Cui, L.; Xue, W.; Chen, J.; Che, Y. Recent Developments on the Mechanism and Kinetics of Esterification Reaction Promoted by Various Catalysts. Chem. Kinet. 2012, 2, 255–282. [Google Scholar] [CrossRef]
Forbes, D. Brønsted Acidic Ionic Liquids: The Dependence on Water of the Fischer Esterification of Acetic Acid and Ethanol. J. Mol. Catal. A Chem. 2004, 214, 129–132. [Google Scholar] [CrossRef]
Mandle, R.J.; Goodby, J.W. Progression from Nano to Macro Science in Soft Matter Systems: Dimers to Trimers and Oligomers in Twist-Bend Liquid Crystals. RSC Adv. 2016, 6, 34885–34893. [Google Scholar] [CrossRef]
But, T.Y.S.; Toy, P.H. The Mitsunobu Reaction: Origin, Mechanism, Improvements, and Applications. Chem. Asian J. 2007, 2, 1340–1355. [Google Scholar] [CrossRef]
Riechert, O.; Husham, M.; Sadowski, G.; Zeiner, T. Solvent Effects on Esterification Equilibria. AIChE J. 2015, 61, 3000–3011. [Google Scholar] [CrossRef]
Camp, D.; Harvey, P.J.; Jenkins, I.D. The Effect of Solvent Polarity on the Rate of the Mitsunobu Esterification Reaction. Tetrahedron 2015, 71, 3932–3938. [Google Scholar] [CrossRef]
Taylor, C.J.; Pomberger, A.; Felton, K.C.; Grainger, R.; Barecka, M.; Chamberlain, T.W.; Bourne, R.A.; Johnson, C.N.; Lapkin, A.A. A Brief Introduction to Chemical Reaction Optimization. Chem. Rev. 2023, 123, 3089–3126. [Google Scholar] [CrossRef]
Schneider, G.; Fechner, U. Computer-Based de Novo Design of Drug-like Molecules. Nat. Rev. Drug Discov. 2005, 4, 649–663. [Google Scholar] [CrossRef]
Vaucher, A.C.; Schwaller, P.; Geluykens, J.; Nair, V.H.; Iuliano, A.; Laino, T. Inferring Experimental Procedures from Text-Based Representations of Chemical Reactions. Nat. Commun. 2021, 12, 2573. [Google Scholar] [CrossRef]
He, C.; Zhang, C.; Bian, T.; Jiao, K.; Su, W.; Wu, K.-J.; Su, A. A Review on Artificial Intelligence Enabled Design, Synthesis, and Process Optimization of Chemical Products for Industry 4.0. Processes 2023, 11, 330. [Google Scholar] [CrossRef]
Fabian, B.; Edlich, T.; Gaspar, H.; Segler, M.; Meyers, J.; Fiscato, M.; Ahmed, M. Molecular Representation Learning with Language Models and Domain-Relevant Auxiliary Tasks. arXiv 2020, arXiv:2011.13230. [Google Scholar] [CrossRef]
Frey, N.C.; Soklaski, R.; Axelrod, S.; Samsi, S.; Gómez-Bombarelli, R.; Coley, C.W.; Gadepally, V. Neural Scaling of Deep Chemical Models. Nat. Mach. Intell. 2023, 5, 1297–1305. [Google Scholar] [CrossRef]
Lu, J.; Zhang, Y. Unified Deep Learning Model for Multitask Reaction Predictions with Explanation. J. Chem. Inf. Model. 2022, 62, 1376–1387. [Google Scholar] [CrossRef]
Ross, J.; Belgodere, B.; Chenthamarakshan, V.; Padhi, I.; Mroueh, Y.; Das, P. Large-Scale Chemical Language Representations Capture Molecular Structure and Properties. Nat. Mach. Intell. 2022, 4, 1256–1264. [Google Scholar] [CrossRef]
Irwin, R.; Dimitriadis, S.; He, J.; Bjerrum, E.J. Chemformer: A Pre-Trained Transformer for Computational Chemistry. Mach. Learn. Sci. Technol. 2022, 3, 015022. [Google Scholar] [CrossRef]
Chilingaryan, G.; Tamoyan, H.; Tevosyan, A.; Babayan, N.; Khondkaryan, L.; Hambardzumyan, K.; Navoyan, Z.; Khachatrian, H.; Aghajanyan, A. BARTSmiles: Generative Masked Language Models for Molecular Representations. arXiv 2022, arXiv:2211.16349. [Google Scholar] [CrossRef]
Sterling, T.; Irwin, J.J. ZINC 15—Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324–2337. [Google Scholar] [CrossRef]
Wang, Y.; Pang, C.; Wang, Y.; Jin, J.; Zhang, J.; Zeng, X.; Su, R.; Zou, Q.; Wei, L. Retrosynthesis Prediction with an Interpretable Deep-Learning Framework Based on Molecular Assembly Tasks. Nat. Commun. 2023, 14, 6155. [Google Scholar] [CrossRef]
Taylor, R.; Kardas, M.; Cucurull, G.; Scialom, T.; Hartshorn, A.; Saravia, E.; Poulton, A.; Kerkez, V.; Stojnic, R. Galactica: A Large Language Model for Science. arXiv 2022, arXiv:2211.09085. [Google Scholar] [CrossRef]
Livne, M.; Miftahutdinov, Z.; Tutubalina, E.; Kuznetsov, M.; Polykovskiy, D.; Brundyn, A.; Jhunjhunwala, A.; Costa, A.; Aliper, A.; Zhavoronkov, A. Nach0: Multimodal Natural and Chemical Languages Foundation Model. arXiv 2023, arXiv:2311.12410. [Google Scholar] [CrossRef]
Guo, T.; Guo, K.; Nan, B.; Liang, Z.; Guo, Z.; Chawla, N.V.; Wiest, O.; Zhang, X. What Can Large Language Models Do in Chemistry? A Comprehensive Benchmark on Eight Tasks. arXiv 2023, arXiv:2305.18365. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Zhang, W.; Wang, Q.; Kong, X.; Xiong, J.; Ni, S.; Cao, D.; Niu, B.; Chen, M.; Zhang, R.; Wang, Y.; et al. Fine-Tuning ChatGPT Achieves State-of-the-Art Performance for Chemical Text Mining. ChemRxiv 2023. [Google Scholar] [CrossRef]
White, A.D.; Hocky, G.M.; Gandhi, H.A.; Ansari, M.; Cox, S.; Wellawatte, G.P.; Sasmal, S.; Yang, Z.; Liu, K.; Singh, Y.; et al. Assessment of Chemistry Knowledge in Large Language Models That Generate Code. Digit. Discov. 2023, 2, 368–376. [Google Scholar] [CrossRef]
Jablonka, K.M.; Schwaller, P.; Ortega-Guerrero, A.; Smit, B. Is GPT-3 All You Need for Low-Data Discovery in Chemistry? ChemRxiv 2023. [Google Scholar] [CrossRef]
Bran, A.M.; Cox, S.; Schilter, O.; Baldassari, C.; White, A.D.; Schwaller, P. ChemCrow: Augmenting Large-Language Models with Chemistry Tools. arXiv 2023, arXiv:2304.05376. [Google Scholar] [CrossRef]
Boiko, D.A.; MacKnight, R.; Gomes, G. Emergent Autonomous Scientific Research Capabilities of Large Language Models. arXiv 2023, arXiv:2304.05332. [Google Scholar] [CrossRef]
Vaškevičius, M.; Kapočiūtė-Dzikienė, J.; Vaškevičius, A.; Šlepikas, L. Deep Learning-Based Automatic Action Extraction from Structured Chemical Synthesis Procedures. PeerJ Comput. Sci. 2023, 9, e1511. [Google Scholar] [CrossRef]
Schneider, N.; Stiefl, N.; Landrum, G.A. What’s What: The (Nearly) Definitive Guide to Reaction Role Assignment. J. Chem. Inf. Model. 2016, 56, 2336–2346. [Google Scholar] [CrossRef]
Jin, W.; Coley, C.W.; Barzilay, R.; Jaakkola, T. Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network. arXiv 2017, arXiv:1709.04555. [Google Scholar] [CrossRef]
Sander, T.; Freyss, J.; von Korff, M.; Rufener, C. DataWarrior: An Open-Source Program for Chemistry Aware Data Visualization and Analysis. J. Chem. Inf. Model. 2015, 55, 460–473. [Google Scholar] [CrossRef]
Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. [Google Scholar] [CrossRef]
Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. arXiv 2022, arXiv:2210.11416. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv 2019, arXiv:1910.10683. [Google Scholar] [CrossRef]
Kudo, T.; Richardson, J. SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. arXiv 2018, arXiv:1808.06226. [Google Scholar] [CrossRef]
Allen, C.; Hospedales, T. Analogies Explained: Towards Understanding Word Embeddings. arXiv 2019, arXiv:1901.09813. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s Transformers: State-of-the-Art Natural Language Processing. arXiv 2019, arXiv:1910.03771. [Google Scholar] [CrossRef]
Mastropaolo, A.; Scalabrino, S.; Cooper, N.; Nader Palacio, D.; Poshyvanyk, D.; Oliveto, R.; Bavota, G. Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks. In Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Madrid, Spain, 22–30 May 2021. [Google Scholar] [CrossRef]
Rodriguez-Torrealba, R.; Garcia-Lopez, E.; Garcia-Cabot, A. End-to-End Generation of Multiple-Choice Questions Using Text-to-Text Transfer Transformer Models. Expert Syst. Appl. 2022, 208, 118258. [Google Scholar] [CrossRef]
Zhou, W.; Lee, D.-H.; Selvam, R.K.; Lee, S.; Lin, B.Y.; Ren, X. Pre-Training Text-to-Text Transformers for Concept-Centric Common Sense. arXiv 2020, arXiv:2011.07956. [Google Scholar] [CrossRef]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar] [CrossRef]
Wang, M.; Xie, P.; Du, Y.; Hu, X. T5-Based Model for Abstractive Summarization: A Semi-Supervised Learning Approach with Consistency Loss Functions. Appl. Sci. 2023, 13, 7111. [Google Scholar] [CrossRef]
Öztürk, H.; Ozkirimli, E.; Özgür, A. A Comparative Study of SMILES-Based Compound Similarity Functions for Drug-Target Interaction Prediction. BMC Bioinform. 2016, 17, 128. [Google Scholar] [CrossRef] [PubMed]
Jabeen, F.; Rehman, Z.U.; Shah, S.; Alharthy, R.D.; Jalil, S.; Khan, I.A.; Iqbal, J.; El-Latif, A.A.A. Deep Learning-Based Prediction of Inhibitors Interaction with Butyrylcholinesterase for the Treatment of Alzheimer’s Disease. Comput. Electr. Eng. 2023, 105, 108475. [Google Scholar] [CrossRef]
Öztürk, H.; Özgür, A.; Schwaller, P.; Laino, T.; Ozkirimli, E. Exploring Chemical Space Using Natural Language Processing Methodologies for Drug Discovery. Drug Discov. Today 2020, 25, 689–705. [Google Scholar] [CrossRef] [PubMed]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Müller, A.; Nothman, J.; Louppe, G.; et al. Scikit-Learn: Machine Learning in Python. arXiv 2012, arXiv:1201.0490. [Google Scholar] [CrossRef]
Soucy, P.; Mineau, G.W. A Simple KNN Algorithm for Text Categorization. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001. [Google Scholar] [CrossRef]
De Boom, C.; Van Canneyt, S.; Bohez, S.; Demeester, T.; Dhoedt, B. Learning Semantic Similarity for Very Short Texts. In Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, NJ, USA, 14–17 November 2015. [Google Scholar] [CrossRef]
Po, D.K. Similarity Based Information Retrieval Using Levenshtein Distance Algorithm. Int. J. Adv. Sci. Res. Eng. 2020, 6, 06–10. [Google Scholar] [CrossRef]
Gunawan, D.; Sembiring, C.A.; Budiman, M.A. The Implementation of Cosine Similarity to Calculate Text Relevance between Two Documents. J. Phys. Conf. Ser. 2018, 978, 012120. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics—ACL ‘02, Philadelphia, PA, USA, 7–12 July 2002. [Google Scholar] [CrossRef]
Song, Y.; Wang, T.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A Comprehensive Survey of Few-Shot Learning: Evolution, Applications, Challenges, and Opportunities. ACM Comput. Surv. 2023, 55, 1–40. [Google Scholar] [CrossRef]
Qiao, H.; Wu, Y.; Zhang, Y.; Zhang, C.; Wu, X.; Wu, Z.; Zhao, Q.; Wang, X.; Li, H.; Duan, H. Transformer-Based Multitask Learning for Reaction Prediction under Low-Resource Circumstances. RSC Adv. 2022, 12, 32020–32026. [Google Scholar] [CrossRef]

Figure 1. BLEU scores of the testing dataset.

Table 1. An excerpt from the dataset. Two examples of input and output pairs.

Input	Output
CC(=O)O.OC1(COC2CCCCO2)CC1>>CC(=O)OC1(COC2CCCCO2)CC1	MAKESOLUTION with $R2$ and $R1$ and DCM; ADD SLN; ADD DMAP at 25 C; STIR for 30 min; ADD DCC at 0 C; QUENCH with water; WAIT for 4 h; FILTER keep filtrate; CONCENTRATE; PURIFY; YIELD $P1$.
COc1ccnc(C(=O)O)c1O.CCO>>CCOC(=O)c1nccc(OC)c1O	MAKESOLUTION with $R1$ and $R2$; ADD SLN; ADD H₂SO₄; REFLUX for 3 d 3 h; CONCENTRATE; PH with water to pH neutral; EXTRACT with chloroform; DRYSOLUTION; CONCENTRATE; YIELD $P1$.

Table 2. Statistics of the subsets.

Dataset Name	Average Input Length	Average Number of Actions
Training—50	82.24	11.42
Training—150	83.17	11.91
Training—500	84.92	12.34
Training—1000	87.09	12.26
Validation	85.83	12.07
Testing	92.38	11.68

Table 3. Three examples of input and predicted output pairs.

True	Prediction
Technically correct
MAKESOLUTION with $R2$ and $R1$ and HOBT and DCC and THF; ADD SLN; STIR for 16 h at 25 C; CONCENTRATE; PARTITION with Ethyl acetate and K₂CO₃; COLLECTLAYER organic; DRYSOLUTION over Na₂SO₄; FILTER keep filtrate; CONCENTRATE; PURIFY; YIELD $P1$.	MAKESOLUTION with $R2$ and DCC and HOBT and $R1$ and THF; ADD SLN; STIR for 16 h at 25 C; CONCENTRATE; PARTITION with Ethyl acetate and K₂CO₃; COLLECTLAYER organic; DRYSOLUTION over Na₂SO₄; FILTER keep filtrate; CONCENTRATE; PURIFY; YIELD $P1$.
Partially correct
MAKESOLUTION with $R1$ and $R2$ and H₂SO₄; ADD SLN; STIR for 24 h at 75 C; CONCENTRATE; ADD Ethyl acetate; ADD NaHCO₃; ADD water; DRYSOLUTION over MgSO₄; FILTER keep filtrate; CONCENTRATE; DRYSOLID; YIELD $P1$.	MAKESOLUTION with $R1$ and $R2$; ADD SLN; ADD H₂SO₄; STIR for 16 h at 85 C; CONCENTRATE; PARTITION with Ethyl acetate and NaHCO₃; COLLECTLAYER organic; WASH with brine; DRYSOLUTION over MgSO₄; FILTER keep filtrate; CONCENTRATE; YIELD $P1$.
Incorrect
ADD $R2$; ADD $R1$; ADD H₂SO₄ at 0 C; STIR for 30 min at 80 C; ADD water; FILTER keep precipitate; DRYSOLID under vacuum; YIELD $P1$.	ADD $R2$; ADD $R1$; ADD H₂SO₄; REFLUX for 6 d; CONCENTRATE; ADD NaHCO₃; EXTRACT with Ethyl acetate 3 x; COLLECTLAYER organic; WASH with brine; DRYSOLUTION over MgSO₄; FILTER keep filtrate; CONCENTRATE; YIELD $P1$.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vaškevičius, M.; Kapočiūtė-Dzikienė, J.; Šlepikas, L. Generative LLMs in Organic Chemistry: Transforming Esterification Reactions into Natural Language Procedures. Appl. Sci. 2023, 13, 13140. https://doi.org/10.3390/app132413140

AMA Style

Vaškevičius M, Kapočiūtė-Dzikienė J, Šlepikas L. Generative LLMs in Organic Chemistry: Transforming Esterification Reactions into Natural Language Procedures. Applied Sciences. 2023; 13(24):13140. https://doi.org/10.3390/app132413140

Chicago/Turabian Style

Vaškevičius, Mantas, Jurgita Kapočiūtė-Dzikienė, and Liudas Šlepikas. 2023. "Generative LLMs in Organic Chemistry: Transforming Esterification Reactions into Natural Language Procedures" Applied Sciences 13, no. 24: 13140. https://doi.org/10.3390/app132413140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generative LLMs in Organic Chemistry: Transforming Esterification Reactions into Natural Language Procedures

Abstract

1. Introduction

2. Related Work

3. Formal Definition of Tasks

4. The Data

5. Applied Machine Learning Approaches

6. Results

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI