Discovering Photoswitchable Molecules for Drug Delivery with Large Language Models and Chemist Instruction Training

Hu, Junjie; Wu, Peng; Li, Yulin; Li, Qi; Wang, Shiyi; Liu, Yang; Qian, Kun; Yang, Guang

doi:10.3390/ph17101300

Open AccessArticle

Discovering Photoswitchable Molecules for Drug Delivery with Large Language Models and Chemist Instruction Training

by

Junjie Hu

^1,†

,

Peng Wu

^2,†,

Yulin Li

³,

Qi Li

¹,

Shiyi Wang

¹,

Yang Liu

⁴,

Kun Qian

^5,*

and

Guang Yang

^1,6,7,8,*

¹

Bioengineering Department and Imperial-X, Imperial College London, London W12 7SL, UK

²

School of Chemistry and Chemical Engineering, Ningxia University, Yinchuan 750014, China

³

Department of Mathematics, The Chinese University of Hong Kong, Shatin, Hong Kong

⁴

Shanxi Bethune Hospital, Shanxi Academy of Medical Sciences, Third Hospital of Shanxi Medical University, Tongji Shanxi Hospital, Taiyuan 030032, China

⁵

Department of Information and Intelligence Development, Zhongshan Hospital, Fudan University, 180 Fenglin Road, Shanghai 200032, China

⁶

National Heart and Lung Institute, Imperial College London, London SW7 2AZ, UK

⁷

Cardiovascular Research Centre, Royal Brompton Hospital, London SW3 6NP, UK

⁸

School of Biomedical Engineering & Imaging Sciences, King’s College London, London WC2R 2LS, UK

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Pharmaceuticals 2024, 17(10), 1300; https://doi.org/10.3390/ph17101300

Submission received: 16 August 2024 / Revised: 23 September 2024 / Accepted: 27 September 2024 / Published: 30 September 2024

(This article belongs to the Section Pharmaceutical Technology)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Background: As large language models continue to expand in size and diversity, their substantial potential and the relevance of their applications are increasingly being acknowledged. The rapid advancement of these models also holds profound implications for the long-term design of stimulus-responsive materials used in drug delivery. Methods: The large model used Hugging Face’s Transformers package with BigBird, Gemma, and GPT NeoX architectures. Pre-training used the PubChem dataset, and fine-tuning used QM7b. Chemist instruction training was based on Direct Preference Optimization. Drug Likeness, Synthetic Accessibility, and PageRank Scores were used to filter molecules. All computational chemistry simulations were performed using ORCA and Time-Dependent Density-Functional Theory. Results: To optimize large models for extensive dataset processing and comprehensive learning akin to a chemist’s intuition, the integration of deeper chemical insights is imperative. Our study initially compared the performance of BigBird, Gemma, GPT NeoX, and others, specifically focusing on the design of photoresponsive drug delivery molecules. We gathered excitation energy data through computational chemistry tools and further investigated light-driven isomerization reactions as a critical mechanism in drug delivery. Additionally, we explored the effectiveness of incorporating human feedback into reinforcement learning to imbue large models with chemical intuition, enhancing their understanding of relationships involving -N=N- groups in the photoisomerization transitions of photoresponsive molecules. Conclusions: We implemented an efficient design process based on structural knowledge and data, driven by large language model technology, to obtain a candidate dataset of specific photoswitchable molecules. However, the lack of specialized domain datasets remains a challenge for maximizing model performance.

Keywords:

drug delivery; photoresponsive molecules; quantum chemistry; language model; RLHF

1. Introduction

Biocompatible materials sensitive to external physicochemical stimuli can be used for drug delivery systems [1,2,3]. Certain compounds have the ability to absorb light at specific wavelengths and subsequently modify their molecular structure, such as through conjugation, conformational changes, or isomerization [3,4,5,6,7]. The concept of light-responsiveness is gaining prominence due to the potential for creating materials that react to harmless electromagnetic radiation, particularly in the UV, visible, and near-infrared spectra [8,9]. Specifically, molecules that undergo isomerization upon excitation hold promise as molecular switches for light-responsive drug delivery [10,11]. Ultraviolet or blue light can serve as a trigger for topical treatments on the skin or mucous membranes. Near-infrared light in the wavelength range of 650 to 900 nm offers a broader range of applications. The 1931 proposal by Nobel Laureate Maria Goeppert-Mayer on two-photon absorption is expected to significantly advance the development of contemporary near-infrared photoresponsive molecules [12]. Aromatic azo compounds with the -N=N- group are a class of molecules that can undergo photoresponsive isomerization [12,13].

The rapid progress of large language modeling has spurred extensive discussions among experts regarding its effectiveness in achieving Artificial General Intelligence (AGI) [14,15]. This conversation parallels earlier debates sparked by the success of DeepMind’s AlphaGo models, which explored the relationship between reinforcement learning and AGI [16]. As AI technology matures, scientific and biomedical research has seen a surge in productivity, with researchers increasingly acknowledging AI’s pivotal role in advancing their fields and enabling more in-depth investigations. Nonetheless, the outcomes of such research remain heavily influenced by domain-specific data. The convergence of drug delivery and research on photoresponsive molecular materials with emerging AI technologies offers significant potential and societal value, paving the way for new innovations and discoveries [17,18,19,20,21].

After the Transformer and Attention algorithms were proposed, GPT-2 further demonstrated the potential of Transformer models for content generation. Researchers have considered that one possible scenario leading to AGI is that intelligence will surge as model parameter size grows. Additionally, more efficient training of models is a key trend in model development. BigBird, Llama, GPT NeoX, and Gemma are representative models in the evolving development of GPT [22,23,24,25]. The increased ability to apply language models is also directly related to Reinforcement Learning with Human Feedback (RLHF), which is a method for training models on human-provided prompts. Reinforcement learning algorithms like Proximal Policy Optimization (PPO) incorporate the reward model as a critical component [14]. The Direct Preference Optimization (DPO) algorithm uses the language model itself as an implicit reward model and has achieved excellent training results [26]. After experimenting with the drug delivery molecule design strategy based on GPT-2 [27], we continued the development of large models by introducing a new language model and an instruction training method to improve generative performance.

In this study, we employed BigBird, Gemma, and GPT NeoX to generate light-responsive drug delivery molecules. We conducted pre-training using PubChem data and fine-tuning on the QM7b dataset containing molecular excitation energies. The generated molecule data were visualized using t-distributed stochastic neighbor embedding (t-SNE) for distribution analysis. PageRank algorithms guided the selection of molecules for refined quantum chemical simulations using the time-dependent density-functional theory (TDDFT) method. Results indicated that among the selected molecules, we collated structural features of photoresponsive molecules based on literature and chemical intuition to construct a preference dataset. To enhance chemical knowledge in our model, we implemented instruction training using the DPO algorithm and chemist feedback, increasing the number of qualifying molecular structures generated by GPT NeoX from around 132 to over 400 (Table 1). Our findings highlight the need for the further integration of comprehensive chemistry knowledge into language models for designing ideal light-responsive drug delivery molecules.

2. Results

2.1. Delivery Large Language Model for Photoresponsive Molecule Discovery

Recently, advancements in large language models (LLMs) have extended into the solving of biochemical challenges. However, the development of language models tailored for designing stimulus-responsive materials in drug delivery remains underexplored. In our previous work, we fine-tuned a pre-trained GPT-2 model on the QM7b dataset and simulated the first excitation energy of molecules using the TDDFT method. As shown in Figure 1, in this study, we further trained and fine-tuned additional large language models for the generation of photoresponsive molecules. To better explore the mechanisms of photoresponsive drug delivery, we used TDDFT to analyze the photocatalytic isomerization of molecules. We also converted the chemical knowledge related to isomerization into a preference dataset and used it for instruction training. Additionally, graph networks and the PageRank algorithm were applied for the first time in our work to recommend molecular content.

Specifically, we pre-trained and fine-tuned BigBird, Gemma, and GPT NeoX. The results of their generation can be found in Figure 2. A total of 3905 candidate molecules were produced using these LLMs, with BigBird contributing 230, Gemma 1640, and GPT NeoX 2035 molecules, respectively. In the distribution plot of Figure 2, green dots represent molecules generated by BigBird, indigo crosses denote Gemma-generated molecules, and dark blue triangles signify molecules generated by GPT NeoX. The t-SNE algorithm was used here for the visualization of over 3000 molecules, where the molecular vector descriptors were based on Morgan fingerprints and physicochemical properties.

Molecules generated by the same model tend to exhibit greater similarity, with distinct disjoint segments observed between molecules generated by Gemma and GPT NeoX. Additionally, molecules produced by Gemma or GPT NeoX often cluster near those generated by BigBird.

2.2. Screening Molecules with QED, SA, and PageRank Score

In the de novo molecular design process, we not only need to use generative models to obtain more content but also need to screen molecules based on specific properties to obtain recommended molecules. Before considering the photoresponsive process of drug delivery molecules, QED and SA also play a positive role in narrowing down the range of candidate molecules for drug delivery.

We conducted statistical analysis on the QED values of the three model-generated molecules and depicted the results in Figure 3a. The figure demonstrates that the overall distribution of the values is more favorable in purple and indigo compared to that in tangerine. GPT NeoX, indicated by purple, and Gemma, represented by indigo blue, demonstrate similar QED performance, with GPT NeoX slightly higher in quantity while Gemma shows slightly better average values.

By combining the content of Figure 3c and Table 2, we can obtain the chemical structures of the top 10 molecules ranked by QED and their corresponding models. Out of the ten molecules chosen based on QED scores, BigBird contributed seven and Gemma contributed three. Higher QED scores generally indicate better drug-like properties.

Similarly, we conducted an analysis of the SA results. Molecules with higher SA scores were found to be easier to synthesize. When combined with the distribution graph in Figure 3b, it is evident that GPT NeoX significantly outperforms both BigBird and Gemma in this regard. Among the top ten ranked molecules in Figure 3d and Table 2, eight were contributed by GPT NeoX and two by BigBird.

To further illustrate the relationships among these molecules, we employed a knowledge graph approach. This method involves representing each molecule as a vector and assessing the similarity between any pair of molecules to generate the adjacency matrix for the knowledge graph network. The edge weights between nodes range from −1 to 1, with 1 indicating the highest similarity and −1 the lowest. The resulting molecule associations yield over 10 million matrix elements, making direct data interpretation challenging. Instead, we applied the graph network-based Page-Rank algorithm to rank nodes based on their associations comprehensively. In Figure 4a, we visualize the edge matrix of the graph network using a heatmap. In Figure 4b, we present the top 20 molecules ranked in the PageRank analysis.

2.3. First Excitation Energy and Photo-Isomerisation Mechanisms

In addressing the requirements for physicochemical properties and synthesis of photoresponsive molecules for drug delivery systems, we have incorporated QED Score and SA Score to prioritize all generated molecules. In order to better reflect the application potential of molecules generated from CLMs, we calculated the first excitation energies of molecules in gas ghase, water and organic solvent environments using TDDFT.

In Table 2, we present detailed computational chemical analysis and excitation energy calculations for the top 20 molecules ranked by PageRank and the top 10 molecules based on QED and SA scores. Out of all the recommended molecules, one molecule can be excited by visible light, while the excitation energies of the other molecules correspond to wavelengths in the UV spectrum. In particular, GPT NeoX-generated molecules like NC1=CSN=C=C1 exhibit excitation wavelengths near the near-infrared spectrum. This is the fifth-ranked molecule based on SA values. These molecules demonstrate an increased depth of transmission compared to others, suggesting greater potential for practical applications.

To delve deeper into the mechanisms underlying photoresponsive reactions, we conducted calculations on the photo-isomerisation of an azo compound in water, as azobenzene is a well-known molecule that exhibits photo-isomerisation between its cis and trans conformations. As illustrated in Figure 5, the azo compound (referred to as A) undergoes isomerisation between cis and trans conformations through rotation about the N=N bond.

Ground-state optimized geometries reveal that the cis isomer is energetically higher by 0.56 eV compared to the trans isomer. Additionally, TDDFT calculations yield transition energies of 4.05 eV and 4.56 eV for the

S 1

←

S 0

transition in the cis and trans conformations, respectively. The calculated transition energy is 6 eV along the isomerization, indicating that the S2 state can be ruled out in the reaction.

The potential energy surface calculated for A is depicted in Figure 5. In the ground state, the barrier height for the cis to trans transition is measured at 1.32 eV, closely matching the highest points on the energy surface along this pathway (1.35 eV relative to the cis conformation). Conversely, in the first excited state, there exists essentially no energy barrier along the rotation pathway. Furthermore, no curve crossing between the S0 and S1 states is observed, suggesting that photo-isomerisation is less likely to occur irrespective of the excitation wavelength.

2.4. Instruction Training

Scientific discoveries and technological advances evolve through iterative trial-and-error processes rooted in scientific hypotheses and experimental validation [28]. We explored the mechanism of photocatalytic isomerisation through computational simulation experiments, beginning with the intuitive selection of generated molecules and the subsequent structural modification of one of them.

Based on the literature [12,13] and discussions related to TDDFT’s analysis of light-driven heterostructures, we recognize that the difference in energy levels (S0 and S1) directly impacts the occurrence of heterostructures. It is observed that the inclusion of the -N=N- functional group remains a viable approach to serve as a switch for light-responsive molecules on drug delivery nanoparticles.

Screening the generated molecules against this criterion yielded only a few satisfying examples. We then adopted a less stringent condition, requiring two chemically bonded N atoms adjacent to the uncyclic one. Under this relaxed criterion, we validated the transfer of chemical domain knowledge to CLMs through instruction training.

Additionally, we fine-tuned the GPT NeoX model using Preference Datasets as per the DPO algorithm, incorporating the Low Rank Adapter (LoRA) algorithm [29]. In this process, certain layers were frozen (shown in grey in Figure 6), with LoRA layers primarily focused on the Attention Mechanism module for training. Post-convergence under the DPO algorithm, the number of molecules meeting the relaxed chemical criterion for this model was 439 entries (Table 1). This data comparison validates the effectiveness of Chemist Instruction Training.

Further enhancements in molecule generation require specialized chemical knowledge, as well. Reinforcement Learning with Human Feedback (RLHF) algorithms like DPO provides a framework to incorporate diverse knowledge into language models. Our DPO-based chemist instruction training notably increased the number of molecules satisfying the criterion.

3. Discussion

Open-source efforts on large language models for text have laid a crucial foundation for drug delivery applications. However, developing practical light-responsive molecular materials still demands sustained interdisciplinary research. Among the three models utilized in our study, GPT NeoX demonstrates outstanding performance in generating light-responsive drug delivery molecules. We also considered more refined photocatalytic isomerisation mechanisms to assess the molecular generation effect and to provide an important reference for subsequent model design.

Furthermore, we identified the need for a recommendation algorithm aligned with drug delivery design metrics to enhance the overall intelligent design process.

Research on light-responsive drug delivery molecular materials driven by GPT technology has greatly improved the efficiency of discovering potential molecules. However, these candidates require further analysis and validation before they can be used in the preparation of nanocarriers for drug delivery systems. Subsequent steps include designing the molecular synthesis route, sample preparation, and experimental determination of the photochemical properties of molecular solvents and crystals. Deep learning and computational chemistry tools can also enhance the ability of experienced synthetic chemists to find synthesis routes. Ultimately, the molecules designed by GPT will also be experimentally validated in intelligent drug delivery systems.

The traditional trial-and-error model, combined with data- and knowledge-driven LLM technology, can significantly improve R&D efficiency, representing an enhancement of the fourth-generation paradigm for new material discovery [30].

4. Materials and Methods

Large Language Models: GPT-2 [31] represents a type of causal language model (CLM) utilized in our previous research on photoresponsive drug delivery molecules [27]. CLMs like BigBird, Gemma, and GPT NeoX exhibited superior performance and were employed in this study. CLMs predicts the next token in a sequence of tokens. Herein, the Byte-Pair Encoding Tokenizer for SMILES data was trained on the PubChem datasets [32]. We initially employed a vocabulary of 72 characters from the SMILES alphabet, resulting in a tokenizer size of 1072. For optimization, we utilized AdamW with cosine annealing for learning-rate scheduling, setting the initial learning rate to

5 \times 10^{- 4}

and the final learning rate to

5 \times 10^{- 8}

. The BigBird, Gemma, and GPT NeoX models were all implemented using the Transformers package from Hugging Face. The configurations for these models were set with default parameters. Then pre-trained generative models were fine-tuned with the QM7b datasets, and the SMILES data were provided by Prof. Alexandre Tkatchenko’s Group [33,34].

Instruction Training: We utilized reinforcement learning with human feedback to refine the language models, employing the transformer reinforcement learning package. The preference datasets used here are structured as dictionaries containing prompts, chosen responses, and rejected responses. In the training process, we employed direct preference optimization (DPO) with chemical-knowledge-based feedback.

Drug Likeness Score: Drug likeness is a consideration when evaluating generative molecules for photoresponsive drug delivery. We utilized the quantitative estimate of drug-likeness (QED) as a metric, as introduced in the referenced work [35,36]. The QED metric yields a numerical score within the range of 0 to 1, where elevated scores correspond to an increased probability of drug likeness.

Synthetic Accessibility Score: The SA score [36,37], used to assess the ease of synthesizing drug-like molecules, rates molecules from 1 to 10 based on historical synthetic data and molecular complexity. Fragment contributions and a complexity penalty are derived from PubChem’s extensive molecule database, forming the foundation of this approach. Validation against estimations by expert chemists demonstrates strong agreement (r2 = 0.89). This method leverages big data to streamline and improve the synthesis evaluation process in molecular design.

PageRank of Knowledge Graph Network: We employed networkx to construct a Knowledge Graph Network for generating molecular information, followed by utilizing PageRank for scoring and molecular recommendation. The molecular features primarily encompass properties such as molecular fingerprints, drug-like characteristics, and structural alerts (SA). Additionally, we preprocessed these features using principal component analysis (PCA) and t-distributed stochastic neighbor embedding (TSNE) to build the adjacency matrix of the graph network.

Computational Methods: All calculations were conducted utilizing ORCA (version 5.0.4) [38]. Ground-state geometries were determined by employing the PBE0 functional [39] along with the basis set

6 - 311 G^{*}

in both the gas phase and in conjunction with the CPCM model [40,41,42] for solvation in water and chloroform. This method has previously been validated for its ability to accurately reproduce experimental findings [43]. For excited-state calculations, time-dependent density-functional theory (TDDFT) was employed using the PBE0 functional [39] and TZVP [44] basis set. The excited-state potential vertical excitation energies were found for each of the points in the ground-state potential energy surfaces. The resulting vertical excitation energies represent the excited-state potential at each point on the ground-state potential energy surfaces.

5. Conclusions

Our work provides a novel data-centric workflow where advanced LLMs, such as BigBird, Gemma, and GPT NeoX, along with first-principles computational simulations, were used. Our goal is to expedite the scientific discovery process of sophisticated light-responsive drug delivery molecules. Among the selected molecules, more human-friendly NIR-responsive molecules occupy a small percentage. In future work, these limitations can be addressed by collecting training data from existing studies or designing models to learn strategies to reduce excitation energy from existing data.

Our computational simulation experiments reveal differences in the structural stability of model-generated molecules compared to reported molecules, highlighting an additional limitation of the models discussed in our study. These models employ tokenizers that consider atomic and bonding features of chemical structures, limiting the inclusion of more physicochemical information in the generated results. One potential solution to address this limitation is to incorporate first-principles-based features for the essential functional groups of photoresponsive drug delivery molecules and to develop new language models with innovative attention mechanisms inspired by quantum computing to align with these features.

Author Contributions

Conceptualization, J.H. and G.Y.; Methodology, J.H., P.W., Y.L. (Yulin Li), Y.L. (Yang Liu), Q.L., S.W., K.Q. and G.Y.; Software, J.H., Y.L. (Yulin Li) and P.W.; Validation, J.H.; Formal analysis, J.H. and P.W.; Investigation, J.H.; Data curation, J.H. and P.W.; Writing—original draft, J.H.; Writing—review and editing, J.H. and G.Y.; Visualization, J.H.; Project administration, J.H., K.Q. and G.Y.; Funding acquisition, G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported in part by the ERC IMI (101005122), the H2020 (952172), the MRC (MC/PC/21013), the Royal Society (IECNSFC211235), the NVIDIA Academic Hardware Grant Program, the SABER project supported by Boehringer Ingelheim Ltd., NIHR Imperial Biomedical Research Centre (RDA01), Wellcome Leap Dynamic Resilience, UKRI guarantee funding for Horizon Europe MSCA Postdoctoral Fellowships (EP/Z002206/1), and the UKRI Future Leaders Fellowship (MR/V023799/1).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the files about LLMs and RLHF can be obtained via our link (accessed on 26 September 2024): https://github.com/jhu22/Pharmaceuticals2024. The data and codes also can be obtained from the author by email.

Acknowledgments

Many thanks to Lulu Cai for the help provided during the manuscript review stage and for the meaningful discussions on the potential of large models and computational simulation methods in the field of drug delivery (https://faculty.uestc.edu.cn/cailulu/en/index.htm, accessed on 26 September 2024). Many thanks to Alexandre Tkatchenko and Leonardo Medrano Sandonas for their important help in understanding and using QM7b (http://quantum-machine.org).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Details of LLMs

In this subsetion, we provide the algorithms for the attention mechanisms employed in LLMs, with schematic references to the original literature.

The pre-training of the causal language models was conducted using the PubChem dataset. Additionally, the Byte-Pair Encoding Tokenizer employed here encodes the atomic and bond information from the SMILES representation of a molecule byte-by-byte, which transforms the SMILES data into input for our three pre-trained language models: BigBird, Gemma, and GPT NeoX.

The LLMs based on BigBird utilize an encoder–decoder architecture with Sparse attention, an enhanced attention mechanism derived from BERT. The attention mechanism in BigBird incorporates features of random attention, window attention, and global attention (Figure A2). Figure A1a illustrates the detail of BigBird, showcasing the encoder comprised of the attention mechanism layer and the following three layers. The decoder consists of the last two layers, which include the Masked Language Model Heads.

In contrast, both Gemma and GPT NeoX employ a decoder-only architecture. Gemma utilizes a Multi-Query attention mechanism, illustrated in Figure A3, which streamlines the Key-Query-Value computation compared to Multi-Head and Grouped Query approaches. Gemma also employs rotary embedding [24] (Figure A5). GPT NeoX utilizes flash attention. Figure A4 demonstrates the storage strategy for Query, Key, Value, and Scores during the computation of attention. These various attention mechanism algorithms are designed to minimize the computational load of LLMs. The QM7b dataset contains excitation energy data for molecules, providing important material for the study of light-responsive molecules. These molecular data from QM7b were also used to fine-tune our LLMs.

Figure A1b illustrates the processing and analysis flow of the BigBird, Gemma, and GPT NeoX generation results in our study. Initially, the generated molecule data underwent data cleaning and a chemical structure reasonableness check. For the screening results, we extracted the molecules’ Morgan fingerprints, QED scores, and SA values. The feature descriptors derived from these molecules were then analyzed using PCA and TSNE algorithms. The TSNE results were visualized to observe differences in model performance. Additionally, the results from PCA and TSNE analyses were used to calculate intermolecular relationships and generate a neighborhood matrix. This matrix facilitated the construction of a graph network and ranking of all molecules using the PageRank algorithm. Furthermore, the QED and SA values were utilized to determine the ordering of the molecules.

Figure A1. The workflow of intelligent photoresponsive molecules. (a) Description of the layers in Large Language Models. (b) The post-processing of molecules generated by Large Language Models.

Figure A2. The Block sparse attention of BigBird.

Figure A3. Multi-query attention.

Figure A4. Flash attention.

Figure A5. Rotary embedding.

Appendix A.2. Optimized Coordinates for ’NNCCN=NNC(C)C=C’ Isomerization

Optimized coordinates for A isomerization through rotation about the N=N bond in water.

Cis

C 3.414338000 1.148000000 −2.948062000
C 2.453240000 1.810094000 −2.304456000
C 1.030152000 1.340845000 −2.171861000
N 0.565434000 1.436787000 −0.765514000
N −0.018235000 0.420347000 −0.057741000
N 0.468095000 −0.721198000 0.045662000
C 1.764858000 −1.106221000 −0.513332000
C 1.600521000 −2.202974000 −1.562107000
N 2.895426000 −2.782005000 −1.893707000
N 2.820639000 −3.776054000 −2.909832000
C 0.103066000 2.185822000 −3.046965000
H 4.419490000 1.550929000 −3.043161000
H 3.226481000 0.179493000 −3.410081000
H 2.662630000 2.782510000 −1.856343000
H 0.947292000 0.298006000 −2.491054000
H −0.052585000 2.232296000 −0.652310000
H 2.360560000 −0.269106000 −0.892697000
H 2.310142000 −1.545726000 0.332253000
H 0.966821000 −2.997542000 −1.148498000
H 1.078735000 −1.810570000 −2.455011000
H 3.498891000 −2.049785000 −2.260349000
H 2.452002000 −4.616168000 −2.468425000
H 2.122588000 −3.506686000 −3.610575000
H −0.930944000 1.833977000 −2.962802000
H 0.412339000 2.123030000 −4.094814000
H 0.140158000 3.239718000 −2.744598000

Trans

C −3.646803000 1.358683000 1.418816000
C −2.518253000 2.060561000 1.331076000
C −1.153511000 1.451954000 1.183465000
N −0.573440000 1.914885000 −0.079189000
N 0.312755000 1.210229000 −0.777589000
N 0.844090000 0.253945000 −0.162598000
C 1.805638000 −0.449724000 −0.996872000
C 1.360325000 −1.895054000 −1.192361000
N 2.379562000 −2.647597000 −1.910045000
N 1.995615000 −3.989780000 −2.189774000
C −0.250743000 1.816593000 2.364112000
H −4.610300000 1.845410000 1.548604000
H −3.646050000 0.271044000 1.371167000
H −2.547843000 3.150532000 1.387654000
H −1.238777000 0.357561000 1.121782000
H −1.146120000 2.488188000 −0.684023000
H 1.928427000 0.050917000 −1.968502000
H 2.775745000 −0.448599000 −0.482651000
H 1.213669000 −2.366055000 −0.211765000
H 0.382437000 −1.911181000 −1.709156000
H 2.535479000 −2.207066000 −2.813728000
H 2.099137000 −4.509156000 −1.320241000
H 0.995211000 −4.025956000 −2.410894000
H 0.751887000 1.405866000 2.216206000
H −0.664234000 1.409150000 3.292863000
H −0.173773000 2.904888000 2.465215000

Transition state

C −4.213817000 −0.450788000 0.968925000
C −3.557760000 0.108145000 −0.042077000
C −2.086138000 0.393378000 −0.029112000
N −1.436466000 −0.430277000 −1.049211000
N −0.340585000 −1.129848000 −0.901860000
N 0.214890000 −1.130837000 0.246591000
C 1.252794000 −0.276054000 0.733660000
C 2.579703000 −0.424941000 −0.005454000
N 3.581024000 0.471451000 0.554781000
N 4.815804000 0.439462000 −0.159520000
C −1.793535000 1.872716000 −0.249617000
H −5.285333000 −0.623124000 0.923392000
H −3.704543000 −0.750367000 1.882011000
H −4.093875000 0.405769000 −0.944011000
H −1.639253000 0.059269000 0.912941000
H −1.924269000 −0.616816000 −1.916507000
H 0.950461000 0.787712000 0.705283000
H 1.411056000 −0.515103000 1.788259000
H 2.937138000 −1.454512000 0.097691000
H 2.423677000 −0.240723000 −1.082872000
H 3.236228000 1.421063000 0.467384000
H 5.300848000 −0.402504000 0.131311000
H 4.627335000 0.314620000 −1.155589000
H −0.717104000 2.059337000 −0.260440000
H −2.237603000 2.462970000 0.556141000
H −2.214929000 2.218022000 −1.198421000

References

Vargason, A.M.; Anselmo, A.C.; Mitragotri, S. The evolution of commercial drug delivery technologies. Nat. Biomed. Eng. 2021, 5, 951–967. [Google Scholar] [CrossRef] [PubMed]
Tao, Y.; Chan, H.F.; Shi, B.; Li, M.; Leong, K.W. Light: A Magical Tool for Controlled Drug Delivery. Adv. Funct. Mater. 2020, 30, 2005029. [Google Scholar] [CrossRef] [PubMed]
Liu, D.; Yang, F.; Xiong, F.; Gu, N. The Smart Drug Delivery System and Its Clinical Potential. Theranostics 2016, 6, 1306–1323. [Google Scholar] [CrossRef] [PubMed]
Son, J.; Yi, G.; Yoo, J.; Park, C.; Koo, H.; Choi, H.S. Light-responsive nanomedicine for biophotonic imaging and targeted therapy. Adv. Drug Deliv. Rev. 2019, 138, 133–147. [Google Scholar] [CrossRef] [PubMed]
Jia, S.; Fong, W.-K.; Graham, B.; Boyd, B.J. Photoswitchable Molecules in Long-Wavelength Light-Responsive Drug Delivery: From Molecular Design to Applications. Chem. Mater. 2018, 30, 2873–2887. [Google Scholar] [CrossRef]
Cho, H.J.; Chung, M.; Shim, M.S. Engineered photo-responsive materials for near-infrared-triggered drug delivery. J. Ind. Eng. Chem. 2015, 31, 15–25. [Google Scholar] [CrossRef]
Liu, J.; Kang, W.; Wang, W. Photocleavage-based Photoresponsive Drug Delivery. Photochem. Photobiol. 2021, 98, 288–302. [Google Scholar] [CrossRef] [PubMed]
Barhoumi, A.; Liu, Q.; Kohane, D.S. Ultraviolet light-mediated drug delivery: Principles, applications, and challenges. J. Control. Release 2015, 219, 31–42. [Google Scholar] [CrossRef]
Olejniczak, J.; Carling, C.-J.; Almutairi, A. Photocontrolled release using one-photon absorption of visible or NIR light. J. Control. Release 2015, 219, 18–30. [Google Scholar] [CrossRef] [PubMed]
Karimi, M.; Sahandi Zangabad, P.; Baghaee-Ravari, S.; Ghazadeh, M.; Mirshekari, H.; Hamblin, M.R. Smart Nanostructures for Cargo Delivery: Uncaging and Activating by Light. J. Am. Chem. Soc. 2017, 139, 4584–4610. [Google Scholar] [CrossRef] [PubMed]
Linsley, C.S.; Wu, B.M. Recent advances in light-responsive on-demand drug-delivery systems. Ther. Deliv. 2017, 8, 89–107. [Google Scholar] [CrossRef] [PubMed]
Dudek, M.; Tarnowicz-Staniak, N.; Deiana, M.; Pokładek, Z.; Samoć, M.; Matczyszyn, K. Two-photon absorption and two-photon-induced isomerization of azobenzene compounds. RSC Adv. 2020, 10, 40489–40507. [Google Scholar] [CrossRef] [PubMed]
Sana, B.; Finne-Wistrand, A.; Pappalardo, D. Recent development in near infrared light-responsive polymeric materials for smart drug-delivery systems. Mater. Today Chem. 2022, 25, 100963. [Google Scholar] [CrossRef]
OpenAI. Training language models to follow instructions with human feedback. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS ’22), Red Hook, NY, USA, 28 November–9 December 2022; pp. 27730–27744. [Google Scholar]
OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
Hassanzadeh, P.; Atyabi, F.; Dinarvand, R. The significance of artificial intelligence in drug delivery system design. Adv. Drug Deliv. Rev. 2019, 151–152, 169–190. [Google Scholar] [CrossRef]
Meenakshi, D.U.; Nandakumar, S.; Francis, A.P.; Sweety, P.; Fuloria, S.; Fuloria, N.K.; Subramaniyan, V.; Khan, S.A. Deep Learning and Site-Specific Drug Delivery. In Deep Learning for Targeted Treatments; Malviya, R., Ghinea, G., Dhanaraj, R.K., Balusamy, B., Sundram, S., Eds.; Wiley: Hoboken, NJ, USA, 2022; pp. 1–38. [Google Scholar]
Vora, L.K.; Gholap, A.D.; Jetha, K.; Thakur, R.R.S.; Solanki, H.K.; Chavda, V.P. Artificial Intelligence in Pharmaceutical Technology and Drug Delivery Design. Pharmaceutics 2023, 15, 1916. [Google Scholar] [CrossRef] [PubMed]
Harrison, P.J.; Wieslander, H.; Sabirsh, A.; Karlsson, J.; Malmsjö, V.; Hellander, A.; Wählby, C.; Spjuth, O. Deep-learning models for lipid nanoparticle-based drug delivery. Nanomedicine 2021, 16, 1097–1110. [Google Scholar] [CrossRef]
Gao, J.; Karp, J.M.; Langer, R.; Joshi, N. The Future of Drug Delivery. Chem. Mater. 2023, 35, 359–363. [Google Scholar] [CrossRef] [PubMed]
Zaheer, M.; Guruganesh, G.; Dubey, K.A.; Ainslie, J.; Alberti, C.; Ontanon, S.; Pham, P.; Ravula, A.; Wang, Q.; Yang, L.; et al. Big bird: Transformers for longer sequences. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; Volume 33, pp. 17283–17297. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Gemma Team. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295. [Google Scholar]
Black, S.; Biderman, S.; Hallahan, E.; Anthony, Q.; Gao, L.; Golding, L.; He, H.; Leahy, C.; McDonell, K.; Phang, J.; et al. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv 2022, arXiv:2204.06745. [Google Scholar]
Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C.D.; Ermon, S.; Finn, C. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Volume 36. [Google Scholar]
Hu, J.; Wu, P.; Wang, S.; Wang, B.; Yang, G. A Human Feedback Strategy for Photoresponsive Molecules in Drug Delivery: Utilizing GPT-2 and Time-Dependent Density Functional Theory Calculations. Pharmaceutics 2024, 16, 1014. [Google Scholar] [CrossRef] [PubMed]
Raccuglia, P.; Elbert, K.C.; Adler, P.D.F.; Falk, C.; Wenny, M.B.; Mollo, A.; Zeller, M.; Friedler, S.A.; Schrier, J.; Norquist, A.J. Machine-learning-assisted materials discovery using failed experiments. Nature 2016, 533, 73–76. [Google Scholar] [CrossRef] [PubMed]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Jain, A.; Ong, S.P.; Hautier, G.; Chen, W.; Richards, W.D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 2013, 1, 011002. [Google Scholar] [CrossRef]
Openai, A.; Openai, K.; Openai, T.; Openai, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 13 September 2024).
Adilov, S. Generative Pre-Training from Molecules. Cambridge Engage Preprints. 16 September 2021. Available online: https://chemrxiv.org/engage/chemrxiv/article-details/6142f60742198e8c31782e9e (accessed on 13 September 2024).
Rupp, M.; Tkatchenko, A.; Müller, K.-R.; Von Lilienfeld, O.A. Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning. Phys. Rev. Lett. 2012, 108, 058301–058305. [Google Scholar] [CrossRef] [PubMed]
Montavon, G.; Rupp, M.; Gobre, V.; Vazquez-Mayagoitia, A.; Hansen, K.; Tkatchenko, A.; Müller, K.-R.; Von Lilienfeld, O.A. Machine learning of molecular electronic properties in chemical compound space. New J. Phys. 2013, 15, 095003. [Google Scholar] [CrossRef]
Bickerton, G.R.; Paolini, G.V.; Besnard, J.; Muresan, S.; Hopkins, A.L. Quantifying the chemical beauty of drugs. Nat. Chem. 2012, 4, 90–98. [Google Scholar] [CrossRef]
Anstine, D.M.; Isayev, O. Generative Models as an Emerging Paradigm in the Chemical Sciences. J. Am. Chem. Soc. 2023, 145, 8736–8750. [Google Scholar] [CrossRef] [PubMed]
Ertl, P.; Schuffenhauer, A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminform. 2009, 1, 8. [Google Scholar] [CrossRef] [PubMed]
Neese, F. The ORCA program system. Wiley Interdisciplinary Reviews: Computational Molecular. Science 2011, 2, 73–78. [Google Scholar]
Adamo, C.; Barone, V. Toward reliable density functional methods without adjustable parameters: The PBE0 model. J. Chem. Phys. 1999, 110, 6158–6170. [Google Scholar] [CrossRef]
Barone, V.; Cossi, M. Quantum Calculation of Molecular Energies and Energy Gradients in Solution by a Conductor Solvent Model. J. Phys. Chem. A 1998, 102, 1995–2001. [Google Scholar] [CrossRef]
Marenich, A.V.; Cramer, C.J.; Truhlar, D.G. Universal Solvation Model Based on Solute Electron Density and on a Continuum Model of the Solvent Defined by the Bulk Dielectric Constant and Atomic Surface Tensions. J. Phys. Chem. B 2009, 113, 6378–6396. [Google Scholar] [CrossRef] [PubMed]
Skyner, R.E.; McDonagh, J.L.; Groom, C.R.; Van Mourik, T.; Mitchell, J.B.O. A review of methods for the calculation of solution free energies and the modelling of systems in solution. Phys. Chem. Chem. Phys. 2015, 17, 6174–6191. [Google Scholar] [CrossRef] [PubMed]
Biswas, N.; Umapathy, S. Density Functional Calculations of Structures, Vibrational Frequencies, and Normal Modes of trans- and cis-Azobenzene. J. Phys. Chem. A 1997, 101, 5555–5566. [Google Scholar] [CrossRef]
Schäfer, A.; Huber, C.; Ahlrichs, R. Fully optimized contracted Gaussian basis sets of triple zeta valence quality for atoms Li to Kr. J. Chem. Phys. 1994, 100, 5829–5835. [Google Scholar] [CrossRef]

Figure 1. Workflow for large language models on photoresponsive isomer molecules: Pretraining and fine-tuning of the large language model, screening of generated content, quantum chemical simulation of molecular properties and mechanisms, and reinforcement learning with human feedback.

Figure 2. The molecular content generated by the Pre-trained Language Models (PLMs). Its visualization is based on T-SNE. The PLMs used here include BigBird, Gemma, and GPT NeoX, represented by green dots for BigBird, indigo crosses for Gemma, and dark blue triangles for GPT NeoX.

Figure 3. The evaluation of generative molecules. (a,b) The QED and SA scores of generative molecules for BigBird, Gemma, and GPT NeoX, respectively. Here, the data for BigBird are represented in orange-red, the data for Gemma are represented in indigo blue, and the data for GPT NeoX are represented in purple. (c,d) The chemical structures of the molecules ranked by SA and QED scores, respectively.

Figure 4. Molecule recommendation based on PageRank. (a) Adjacency matrix of molecular features, which is also used to implement knowledge graph networks of PageRank. (b) The chemical structures of the top 20 molecules ranked by PageRank score.

Figure 5. Potential energy diagrams along the isomerisation between cis and trans conformations via the rotation of dihedral (N1-N2-N3-C4) on S0, S1, and S2 states in the solvation of water. Characters 1, 2, 3, and 4 represent the four vertices of a dihedral angle. In the photocatalytic isomerization process of the molecule shown in this Figure, the initial state (cis), transition state (

Δ E

), and final state (trans) are illustrated. The geometric coordinate data corresponding to these states have been added to Appendix A.

Figure 5. Potential energy diagrams along the isomerisation between cis and trans conformations via the rotation of dihedral (N1-N2-N3-C4) on S0, S1, and S2 states in the solvation of water. Characters 1, 2, 3, and 4 represent the four vertices of a dihedral angle. In the photocatalytic isomerization process of the molecule shown in this Figure, the initial state (cis), transition state (

Δ E

), and final state (trans) are illustrated. The geometric coordinate data corresponding to these states have been added to Appendix A.

Figure 6. The workflow of per-trained GPT NeoX with DPO trainer.

Table 1. Number of molecules meeting the chemical requirements, containing two N atoms that are chemically bonded next to each other in an acyclic form. Here, we transformed the structure containing the -N=N- functional group into one that includes nitrogen atoms that are bonded but are not part of a polycyclic ring.

Models	Number of Molecules Meeting the Chemical Requirements
BigBird	17
Gemma	79
GPT Neox	132
Chemist Instruction Training	439

Table 2. The details regarding top recommendation molecules. The organic solvent used here is chlorobenzene.

Methods	SMILES	First Excitation Energy (eV)			Language Model
Methods	SMILES	Gas Phase	Water	Organic Solvents	Language Model
PageRanks	OC1CC1=CO	5.923	6.444	6.4	BigBird
	CCNC1CC1=C	5.066	5.142	5.121	GPT NeoX
	C1C2C=CC=CC12	4.346	4.275	4.281	GPT NeoX
	NC1CCCC=C1	5.873	6.081	6.03	GPT NeoX
	OC1CCOC1=C	6.448	6.525	6.477	GPT NeoX
	OC1CC1C	7.118	7.426	7.364	GPT NeoX
	CCC1CC=C1C	6.942	7.022	6.987	Gemma
	OCC1=NCCN1	6.945	7.372	7.277	GPT NeoX
	OCC1=NCCN1	5.922	6.16	6.113	Gemma
	N=CN1CCCN1	6.112	6.419	6.386	GPT NeoX
	C1C=CC2NC12	5.768	5.852	5.831	GPT NeoX
	OC12CC1=CS2	3.817	3.6	3.632	Gemma
	OC1CNC=NC1	5.965	6.183	6.13	GPT NeoX
	C1C2CC=CC=CC12	4.910	4.801	4.795	GPT NeoX
	OC1=CCN=C1	5.097	5.251	5.207	Gemma
	CC1C(O)C1C	7.019	7.394	7.313	GPT NeoX
	NC1CCC=C1	5.949	6.202	6.165	GPT NeoX
	CC1NCC1O	6.310	6.871	6.753	Gemma
	CC1=CCC2CC12	6.612	6.493	6.473	GPT NeoX
	CC1C(O)C1O	6.648	6.909	6.853	BigBird
QED	CCC1=COCC=CC(=CC(C)CCC1=CN)	3.927	3.847	3.862	BigBird
	CN1CNC=C1CC1CC1C	4.807	5.099	5.016	BigBird
	CNCC1COCN1CC1C=CC1	5.883	6.008	5.99	BigBird
	CNCCC1CCCC1C	6.055	6.319	6.255	BigBird
	CC1CC1CC(C)CC1CC1=O	3.783	3.848	3.83	BigBird
	CCC(C)CC(C)(N)CN	6.240	6.616	6.521	Gemma
	C1CC1CC1CC1CN	6.493	6.918	6.812	BigBird
	CCC1=NC=NC=C1Cl	4.516	4.639	4.606	Gemma
	CCCC(C)C(C)O	7.084	7.485	7.394	Gemma
	CC1CNCC1CC(C)	6.239	6.575	6.484	BigBird
SA	C1C2C3OC1CN23	7.331	7.704	7.62	GPT NeoX
	C1C2NC1C1NC21	6.555	6.969	6.869	GPT NeoX
	CC1=C=CC1C=NN=C	3.604	3.783	3.731	BigBird
	NC1CN2CC1N2	6.317	6.994	6.854	GPT NeoX
	NC1=CSN=C=C1	2.173	2.091	2.041	GPT NeoX
	C1NC2CCC1O2	6.036	6.530	6.416	GPT NeoX
	C1CC=CC=C=C=CC=CCC1N	3.613	3.481	3.465	BigBird
	NC1C2NC=CN12	5.258	5.391	5.347	GPT NeoX
	OC1C2NC1C=C2	5.146	5.45	5.365	GPT NeoX
	CC1C=CON=N1	3.501	3.594	3.568	GPT NeoX

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, J.; Wu, P.; Li, Y.; Li, Q.; Wang, S.; Liu, Y.; Qian, K.; Yang, G. Discovering Photoswitchable Molecules for Drug Delivery with Large Language Models and Chemist Instruction Training. Pharmaceuticals 2024, 17, 1300. https://doi.org/10.3390/ph17101300

AMA Style

Hu J, Wu P, Li Y, Li Q, Wang S, Liu Y, Qian K, Yang G. Discovering Photoswitchable Molecules for Drug Delivery with Large Language Models and Chemist Instruction Training. Pharmaceuticals. 2024; 17(10):1300. https://doi.org/10.3390/ph17101300

Chicago/Turabian Style

Hu, Junjie, Peng Wu, Yulin Li, Qi Li, Shiyi Wang, Yang Liu, Kun Qian, and Guang Yang. 2024. "Discovering Photoswitchable Molecules for Drug Delivery with Large Language Models and Chemist Instruction Training" Pharmaceuticals 17, no. 10: 1300. https://doi.org/10.3390/ph17101300

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Discovering Photoswitchable Molecules for Drug Delivery with Large Language Models and Chemist Instruction Training

Abstract

1. Introduction

2. Results

2.1. Delivery Large Language Model for Photoresponsive Molecule Discovery

2.2. Screening Molecules with QED, SA, and PageRank Score

2.3. First Excitation Energy and Photo-Isomerisation Mechanisms

2.4. Instruction Training

3. Discussion

4. Materials and Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Details of LLMs

Appendix A.2. Optimized Coordinates for ’NNCCN=NNC(C)C=C’ Isomerization

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI