1. Introduction
Discovering new molecules for drugs and materials can lead to enormous societal and technological advances [
1]. However, comprehensive exploration of the vast number of potential chemical drugs is computationally difficult; estimates place the number of pharmacologically appreciable molecules at around
to
compounds [
2,
3]. Often, such searches are limited by the number of discovered structures and desired attributes such as solubility or toxicity. There are many approaches to explore chemical space in silico and in vitro, including high-throughput screening, combinatorial libraries, and evolutionary algorithms [
4,
5,
6,
7].
In the past few years, with the rapid development of deep learning and GPU (graphics processing units) [
8] technologies, as well as the emergence of large molecular datasets (e.g., ZINC [
9] and ChEMBL [
10]), advances in machine learning (especially deep learning) methods have driven the design of new computing systems for modeling increasingly complex phenomena. Deep generative models constitute a method that has proven to be effective in modeling molecular data. Deep generative models have been widely used in a variety of fields, from generating synthetic images [
11] and natural language text [
12] to applications in biomedicine, including DNA sequence design [
13], aging research [
14], target identification [
15], antimicrobial drug discovery [
16], and drug repurposing [
17,
18]. An important application area of deep generative models is the reverse design of drug compounds for given attributes (solubility, ease of synthesis, etc.) [
19].
There are many methods involved in the field of drug molecule reverse design in deep generative models, which can be divided into two categories. The first category is based on the VAE (Variational Autoencoder) [
20], and the second category is based on GANs (Generative Adversarial Networks) [
21]. The VAE (Variational Autoencoder) [
20] is a structure in which variational reasoning is applied to generative models. Currently, a variety of VAE models have been applied to molecular feature extraction and have achieved certain results [
22,
23,
24,
25,
26,
27]. These works emphasize the accuracy of the molecular reconstruction and the ability to explore and inspire new molecules in potential continuous space, as well as optimize molecular specific attributes through Bayesian estimation. Kusner et al. [
23] proposed the GVAE, which is an autoencoder based on context-independent molecular SMILES codes. Jin et al. [
26] proposed the JT-VAE network, which achieves high-precision molecular reconstruction based on molecular graph encoding and molecular Christmas tree-assisted encoding. Jin et al. [
27] later proposed the Hier-VAE model, which converts SMILES [
28] into molecular graphs as input, learns substructure features through motif pre-extraction, simplifies the structure of the molecular generation model through three-layer decoding and encoding steps, and has excellent generation effects in the region of larger molecular weight. However, the VAE model lacks an accurate evaluation of the probability distribution, and the defects in its loss function calculation method often lead to unrealistic and ambiguous samples. In contrast, GANs (Generative Adversarial Networks) [
21] have been proven to be an excellent generative model in the field of computer vision, and GANs have made some initial explorations in the field of molecular design [
29,
30,
31,
32]. Maziarka et al. [
31] proposed a Cycle-GAN for molecular optimization tasks, which learns the characteristics of molecular transformation by corresponding to two types of disjoint molecules. Liu et al. [
33] proposed a combination model of the VAE and GAN, MolFilterGAN, which aims to distinguish whether the molecules generated by the VAE network are biologically active drug molecules or molecules from the generated chemical space. Kadurin et al. [
32] proposed an Adversarial Autoencoder (AAE) derived from the GAN architecture to identify and generate new compounds with predefined attributes. However, the input of this model is the fingerprint MACS [
34], and the network consists only of a fully connected architecture, so it cannot effectively extract the complex features of the molecule.
Therefore, the above methods cannot perform interval classification-specific generation tasks for given attributes (solubility, ease of synthesis, etc.) [
22,
23,
24,
25,
26,
27,
29,
30,
31,
32,
33]. Taking the solubility logP as an example, existing methods cannot generate molecules with the logP in a specific interval (0.0, 1.0]. To address this problem, the main research objectives of this paper are first to propose a novel AC-ModNet model, which effectively combines a VAE [
20] with an AC-GAN [
35] to generate molecular structures with specific interval functions; secondly, this model solves the problem of the weak generalization ability of VAE-like methods and the need for a large number of parameter adjustments. We can achieve the molecular effectiveness of the hierarchical VAE model [
27] with 20 training parameter adjustments through two trainings; then we use convolutional neural networks to build the backbone of the GAN, which can effectively learn molecular graph structures and features; finally, we overcome the difficulty of GAN training by introducing new training loss functions and balance factors, we shorten the training time by an order of magnitude, and we solve the problem of mode collapse of the GAN method and the problem of weak model generalization ability. Our method uses three attributes—logP [
36], SAScore [
37], and QED [
38]—which are classified into logp1–logp6, SA1–SA6, and QED1–QED7. By selecting the interval values, molecules that meet the corresponding conditions are generated.
We trained AC-ModNet on the open ZINC dataset [
9], and the resulting generative model generated molecules with 100% validity, 98.64% average novelty, and 62.47% average uniqueness. We discuss the accuracy of the model through the comparison of the VAE [
20], AAE [
39], JT-VAE [
26], hierarchical VAE [
27], and LatentGAN [
40] models using model evaluation criteria such as the FCD [
41,
42], IntDiv1 [
43], IntDiv2 [
43], SNN [
44], Scaf [
45], and Frag [
46]. Our method performed best in both the FCD and Frag. In addition, by comparing and analyzing highly credible generated molecules with existing molecules on the PubChem dataset [
47], we show that the molecules created by our model have the potential to be used in drug design. The results of this study will provide new avenues for machine learning drug reverse design and will also provide more reference methods for drug synthesis experiments.
3. Experiment
In the implementation of the AC-ModNet model illustrated in
Figure 1, we chose the Hier-VAE [
27] as the molecular encoder–decoder model. The generator and discriminator of the AC-ModNet were both based on convolutional neural networks.
Generator network details: We embedded the category information as a vector of length 50, and we input the vector into the generator with noise. The input conversion interface of the generator consists of a two-layer FCN (fully connected network) with the Tanh activation function, where the length of the hidden layer is 128. It maps the molecular feature vectors into a size image matrix, which facilitates subsequent convolutional learning. The convolutional unit of the generator consists of a convolutional layer, a LeakyReLU activation function, and a BatchNorm layer. The generated convolutional network consists of four convolutional units, with the first two each having an downsample layer finally outputting a feature matrix. The feature matrix was straightened and mapped into a generated vector with a length of 250 through a two-layer FCN, where the middle layer has a length of 500.
Discriminator network details: The molecular feature vector with a length of 250 obtained through the encoder or the generator was inputted into the discriminator. We used a two-layer FCN to map the molecular feature vectors into a size image matrix, where the hidden layer vector has a length of 512. Compared to the generator, the discriminator’s convolutional unit incorporates an additional dropout layer (dropout ratio of 0.25). The discriminative convolutional network consists of 4 convolutional units and a residual network, thus outputting an matrix with a depth of 128. The depth image was straightened into a vector, and the output part was composed of two different FCNs. One two-layer FCN maps the vector to a confidence in [0, 1] through the sigmoid activation function. Another two-layers FCN maps the vector to output discriminant values with class quantity length using the softmax activation function.
We used Python 3.8 to build the AC-ModNet, speed up the operation based on the 11.6 CUDA module, and perform molecular fingerprint calculation and molecular graph analysis by using the 2020.09.1.0 Rdkit [
49] chemical toolkit.
3.1. Data Processing
The AC-ModNet is evalauted using the ZINC-250K open source dataset [
9]. We used the molecular SMILES code and its three common important attributes—the logP, QED, and SAscore—which were input into the network after data cleansing and data balancing.
In order to obtain a rich set of recognizable motifs, we used the encoder–decoder model, which was pretrained in the ChEMBL dataset [
10] with approximately 1800 K molecular units. Meanwhile, we removed 1936 molecular units, which could not be compiled by the encoder in the ZINC dataset in the data cleaning process. These molecules only account for 0.78% of the total data.
logP [
36]: The lipid–water partition coefficient was proposed by Scott and Gordon to measure the lipophilicity (degree of fat solubility) of molecular compounds. Generally, when oral drugs penetrate through passive diffusion, a logP at 0–3 is considered as the best absorption range of the human body. High logP compounds have poor water solubility, and low logP compounds have poor lipid permeability. Therefore, we classified the low-value interval
into one category and the high-value interval
into one category. The range of the logP with high bioavailability was more finely divided into six types of lipid–water partition coefficients.
Quantitative estimate of drug likeness (QED) [
37]: Tian, Sheng, et al. proposed to use the important attributes of molecules to evaluate the drug-like attributes of molecules. This property ranges at (0, 1): a higher value indicates that more molecules evaluated by combining multimolecular descriptors fit the drug specification. Based on the need for high drug similarity and the distribution of this property under the Zinc 250 K dataset, we categorized low drug similarity at (0, 0.5) into one category, and the high drug similarity partition interval was set at 0.1. It was divided into six categories.
SAscore (SAS) [
38]: This method, proposed by Ertl and Schuffenhauer [
38], is composed of the fragmentScore and complexityPenalty. This method characterizes molecular synthesis accessibility as a fraction between 1 (easy to synthesize) and 10 (difficult to synthesize). Although the SAscore is widely distributed in the range of [2, 8], the distribution of the synthesis accessibility in biomolecules and recorded molecules is mainly concentrated in the range of [2, 4.5]. Therefore, we classified the extremely easy synthesis interval [1, 2] into one category, the difficult synthesis interval [4.5, 10] into one category, and we set the partition interval of the easier synthesis drug interval to
. This attribute was divided into seven categories, and the data were balanced using the oversampling method. See
Table 1 for more details.
3.2. Analysis of Backbone Network
The network structure has a significant impact on the molecular generation. The parameter effects obtained from the same throughput data of different networks also vary greatly.
We constructed two types of kernels—a fully connected network (model 1) and a convolutional network (model 2)—and we compared the convergence results of the model after training with the same data for equal iterations. Model 2 is the network described at the beginning of this section, and model 1 is the network that replaces the convolutional network and the first FCN connected to the start and end of the convolutional network with a
FCN, which also has a similar residual and dropout structure. Given a specific category, its accuracy is the ratio of the number of generated molecules in the the category to the total required amount. We generated 10k molecules for each category, and the accuracy of each category of the two molecular generated models is described in
Table 2.
Observing the table, the accurate proportion of the molecules generated by the FCN is very low, which is close to the distribution of randomly generated molecules in the corresponding attribute categories. On the contrary, the accuracy of the generative model constructed by the CNN is significantly higher. The reason is that molecules have similar characteristics of graphs, and they can be well analyzed and learned using a convolutional neural network. Although the fully connected network is more universal compared to CNN, it is more difficult to be trained under the same situation.
Furthermore, we can observe that the relative accurary ratio of the generated molecular units across all the categories of model 2 tilts toward the distribution of the training data, as shown in
Table 1. This makes sense, since the model can learn more for a category by providing more data belonging to it, which proves the learning effectiveness of our model with the backbone of the convolutional neural network.
3.3. Evaluation of Loss Function Adjustment
In the experiment, we firstly tried to use the original loss function for training, and we observed that the discriminator lost its rate of convergence quickly, but the generator lost no sign of convergence, which ultimately led to training failure. Hence, we have made some improvements to the generator and discriminator objective functions, which are described in
Section 2.3.1.
Based on the adjusted loss functions, we trained each of the three molecular attributes 10 times and calculated the average loss curve of the network discriminator, as shown in
Figure 2. It can be seen that the losses in the model training process could be effectively converged, thus indicating that the discriminator could effectively extract the associated features between the molecular structure and the attributes using the adjusted loss functions.
3.4. Evaluation of Adaptive Balance Factor
We introduced an adaptive “balance factor” that automatically adjusts the number of iterations for the discriminator and generator in a single loop based on loss, as described in
Section 2.3.2.
Figure 3 shows the accuracy change curve of the generator generating molecules for the specified category without and with the balance factor. It can be seen that, compared to the model collapse process where the accuracy dropped sharply after severe oscillation, the accuracy of the generator generating molecules with the addition of balance factors could be steadily improved.
The objective function of the whole network consists of the sum of two crossentropy losses in the range
, so we set the loop stop condition to the loss value 0.5, which is relatively small enough for the objective function to stop convergence. The training epoch of the AC-ModNet only needed one to two epochs to achieve the desired effect, which is one order of magnitude less than the epochs required for VAEs (as shown in
Table 3). This shows that the “balance factor” significantly improved the training speed of the generative model.
5. Conclusions and Future Work
This article has proposed a molecular inverse design method, AC-ModNet, which effectively combines the VAE with the GAN. Evaluations in the ZINC dataset prove that the AC-ModNet can efficiently converge, and the generated molecules for the specifying categories are valid, novel, and relatively unique. Comparisons between our model-created molecules and the PubChem database show the applciation value of the AC-ModNet in drug design.
In future work, we will focus on attempting to expand our model in molecular 3D datasets, and we will advance our work in the field of molecular optimization. On the other hand, we will try some new generation models, such as the diffusion model, in the molecular design.