1. Introduction
Drug development is crucial for the treatment of diseases [
1,
2]. Traditional drug development is divided into three stages: the discovery stage, preclinical stage, and clinical stage. Developing a new drug typically requires 10–20 years and costs billions of dollars, which poses considerable challenges. To address these issues, drug repositioning offers an alternative approach by identifying new therapeutic uses for existing approved drugs. This strategy significantly reduces the drug development time and lowers the costs [
3,
4,
5]. Consequently, drug repositioning is widely applied by research-based pharmaceutical companies in their drug-discovery efforts.
Drug-repositioning algorithms can be typically categorized into feature-based, matrix-factorization-based, and network-based methods to predict the associations between drugs and diseases [
6,
7]. (1) Feature-based methods involve analyzing the chemical and biological properties of drugs, as well as the phenotypic characteristics of diseases, using data-driven machine learning models to predict potential connections between drugs and diseases [
8]. (2) Matrix-factorization-based methods decompose the interaction matrix between drugs and diseases into feature vectors through mathematical techniques to compute their similarity, thereby predicting new indications for drugs. This approach can handle large-scale datasets, flexibly integrate more prior information, identify potential connections between drugs and diseases, and aid in the rapid discovery of new therapeutic approaches [
9,
10]. (3) Network-based drug-repositioning methods aim to use the internal association matrix (e.g., drug–drug or disease–disease matrix) to predict the external associations between drugs and diseases, which can be regarded as a binary classification task for each drug–disease association [
11].
With the development of neural networks, network-based algorithms have gradually become the mainstream for drug-repositioning tasks. As a typical approach, Xuan et al. developed a drug-repositioning approach based on convolutional neural networks (CNNs) and bidirectional long short-term memory (BiLSTM) networks, with the BiLSTM module using an attention mechanism to learn path representations of drug–disease pairs by balancing contributions from different paths [
12]. Graph convolutional networks (GCNs) are also widely used in this task because the connection nature of association matrices can be transformed into graphs to capture the features of drug–drug or disease–disease associations. For example, Wang et al. utilized bipartite graph convolution operations to model macroscopic and microscopic information exchange between drugs and diseases through protein nodes, thus effectively leveraging interaction relationships to predict potential diseases that drugs may treat [
13]. Yu et al. further introduced a hierarchical-attention-based graph convolutional network for drug repositioning by utilizing relationships at different graph convolution layers to enhance the predictive accuracy [
14].
Since the drug-repositioning model learns from a small-scale internal association matrix, it struggles to acquire sufficient knowledge for effective drug repositioning [
15]. However, the aforementioned network-based methods hardly rely on the introduced information for model training. Meanwhile, we observe that there might be diversity in the associations between each drug and disease, and it is possible to further explore subcategories of these associations to introduce more information for model learning. Therefore, the main challenge of this work is how to uncover this potential diversity or subcategory knowledge to improve the classification performance for drug repositioning.
In this paper, we propose a prototype-based subcategory exploration (PSCE) model to introduce the potential knowledge of subcategories for model training for drug repositioning. First, we propose a prototype-based feature-enhancement mechanism (PFEM) that employs the K-means method [
16,
17] to obtain the clustering subcategories for each sample, and the clustering centroids are regarded as the class-relevant prototypes [
18,
19,
20]. In the proposed PFEM, prototypes are used to attach attention to original graph features to obtain the enhanced features. Second, we introduce a drug–disease dual-task classification head (D3TC) of the model, which consists of a traditional binary classification head and a subcategory-classification head to learn with subcategory exploration. It leverages finer-grained pseudo-labels of subcategories to introduce additional knowledge for precise drug–disease association classification. We conducted experiments on four public datasets to compare with several existing drug-repositioning methods. In the experiment, the PSCE achieved a state-of-the-art performance. Finally, we conducted ablation studies to demonstrate the effectiveness of the proposed PFEM and D3TC. The contributions of this paper are summarized as follows:
This paper presents a prototype-based feature-enhancement mechanism (PFEM) by making full use of the potential knowledge of subcategories for model training, based on which the classification performance in the drug-repositioning task can be significantly improved.
For the proposed PFEM, we propose a drug–disease dual-task classification head (D3TC) of the model for subcategory exploration to learn the potential feature representation of subcategories by building additional constraints to improve the performance of the drug–disease association predictions.
Experimental comparisons showed that the PSCE could achieve state-of-the-art performance with respect to the best existing drug-repositioning methods on four datasets.
2. Materials and Methods
As shown in
Figure 1, in this section, we systematically introduce the PSCE method proposed for the drug-repositioning task. We first introduce the datasets we used, and then we show the overall framework of our model and provide detailed introductions to the two main modules of our model: the PFEM and D3TC modules. Finally, we present the implementation details of our approach.
2.1. Datasets
We used four datasets to demonstrate the effectiveness and evaluate the performance of our method: Gdataset [
21], Cdataset [
22], Ldataset [
14], and LRSSL [
23]. These datasets are widely used in the drug-repositioning task. Among them, the Gdataset includes 1933 confirmed drug–disease associations, including 593 drugs from the DrugBank database and 313 diseases from the OMIM database.The Cdataset contains 663 drugs, 409 diseases, and 2352 drug–disease interaction pairs. The Ldataset was compiled from the CTD dataset, which includes 18,416 associations between 269 drugs and 598 diseases. The last dataset, namely, LRSSL, contains 3051 validated drug–disease associations involving 763 drugs and 681 diseases. The specific statistical information of these datasets is shown in
Table 1.
In our method, by observing the relationship between the disease and drug features in the feature space, we propose a novel feature that combines clustering features to calculate the similarity between drugs and diseases. To better interpret the features, we also propose a method that divides the binary classification task into more subtasks through unsupervised clustering so that the model can better distinguish hard samples.
2.2. Overview
We used the drug–disease association matrix, drug–drug similarity matrix, and disease–disease similarity matrix to construct a graph network structure and obtain potential drug–disease relationships. The drug–disease association matrix
X represents the known associations between drugs and diseases and is a binary
matrix, where p and q represent the numbers of drug and disease types, respectively. Each element
in
X indicates the association between drug
and disease
, where if there is an association,
, and otherwise,
:
The drug–drug similarity matrix
R represents the similarity between drugs and is a
matrix, where
p is the number of drug types. Each element
in
R represents the degree of similarity between the i-th drug and the j-th drug, which is specifically defined as
Similarly, the disease–disease similarity matrix
D represents the similarity between diseases and is defined as
The purpose of the drug-repositioning task is to predict unknown potential associations between drugs and diseases by studying the similarity between drugs, the similarity between diseases, and the known associations between drugs and diseases.
2.3. Model Architecture
Existing methods have already shown the effectiveness of a GNN in constructing associations between drugs and diseases. Our method takes the drug similarity matrix
R and the disease similarity matrix
D as the input of the network to construct the corresponding graph structures
and
, respectively, according to the element adjacency relationship. The obtained graph structures are then fed into the graph neural network for preliminary feature extraction, which results in drug features
and disease features
. To better represent the features of similar drugs/diseases, we used a clustering feature-enhancement method (PFEM) to strengthen the expression ability of the features, thus obtaining the enhanced drug features
and disease features
. We obtained the drug–disease similarity features by unfolding the obtained drug features and disease features in the form of a tensor product, which was then used for the predictions:
where
p is the number of drug types,
q is the number of disease types, and ⨁ represents the concatenation operation.
We then used the drug–disease association matrix as the label to supervise the learning of these features. In previous methods, a simple decoder was used to parse the features to achieve classification, but we believe that simple binary classification cannot distinguish some difficult samples, and thus, we propose a new classification head (D3TC) to improve the classification performance and obtain the final prediction probability matrix .
2.4. Prototype-Based Feature-Enhancement Mechanism
In order to obtain the underlying associations between drugs and between diseases, previous methods often relied on the k-nearest neighbor graph of the similarity matrix to construct stronger similarity. However, in this paper, we believe that features with closer clustering in the feature space have stronger similarity. To enhance this similarity, and thus, obtain more subtle associations between diseases and drugs, we propose a prototype-based feature-enhancement method (PFEM).
We used the features extracted by the graph neural network as the initial features for enhancement. For the drug features , where p is the number of drug types and s is the feature dimension, we performed k-means clustering on the features to group the p drugs into k clusters and obtained the feature of each cluster center . We then fused each drug’s own feature with the feature of the cluster center it belonged to to obtain the enhanced features . Specifically, we used an attention mechanism to acquire more representative features. Similarly, for the disease features , we adopted the same method to obtain the enhanced disease features .
2.5. Drug–Disease Dual-Task Classification Head
Although we obtained representative features, predicting the potential similarity probability between drugs and diseases is still a challenging task, as there are still some difficult samples. The traditional decoder treats this prediction task as a binary classification problem that results in classification results with high inter-class similarity, which hinders the formation of diverse features. To obtain a better prediction performance, we propose a drug–disease dual-task classification head (D3TC).
In addition to the binary classification task of predicting whether there is an association between a drug and a disease, we further extended each class into T sub-classes that represent different degrees of relevance and irrelevance (e.g., extremely irrelevant, possibly irrelevant, possibly relevant, extremely relevant). This encourages the model to not only focus on the differences in binary classification but also on the differences in different degrees, thus ultimately obtaining a more subtle feature representation:
where
is the binary classification label,
is the one-hot pseudo-label for the molecular sub-classes, and → represents the process of using the original labels to generate a subcategory label.
First, we trained the binary classification model until it converged. Then, we extracted deep features for each sample and obtained the pseudo-labels for the sub-classes through unsupervised clustering. Finally, we jointly trained the network using both the binary classification labels and the sub-class pseudo-labels. To better train the network, we used a weighted binary cross-entropy loss to supervise the binary classification task:
At the same time, we introduced focal loss and center loss to learn the knowledge of the pseudo-labels. This allowed us to bring the samples of the same class closer in the feature space and push the samples of different classes farther apart. By introducing focal loss, we reduced the weight of the easy samples and focused more on the difficult samples, which helped to push the different classes apart in the feature space:
Center loss was used to minimize the intra-class variability by encouraging the feature vectors of the same class to be close to their corresponding class centers. The center loss was defined as
By combining the loss of the binary classification and the sub-class pseudo-labels, we optimized the classification model:
3. Results
In this section, we first give the implementation details of the proposed PSCE in
Section 3.1 and describe the evaluation metrics in
Section 3.2. Then, we give the results of the local leave-one-out 10-time 10-fold cross-validation in
Section 3.3 and ablation study in
Section 3.4.
3.1. Implementation Details
During the training process, we divided the training samples and validation samples based on the drug–disease association matrix. For each element in the matrix, we could treat it as a sample. We randomly split these samples into a training set and a validation set at a ratio of 9:1, and adopted a 5-fold cross-validation experiment to obtain the model’s performance.
Our model used the Adam optimizer for optimization, with a learning rate of 0.01. The mini-batch size for the model training was set to 2000, and a 5-fold cross-validation was adopted. Our experiments were conducted on PyTorch 1.13.1 and a workstation equipped with a 24 GB NVIDIA RTX3090 GPU. In the PFEM, the number of clustering centers was set to half the number of samples, and in the D3TC, the number of sub-classes T was set to five. In the loss function, the weights in were set according to the ratio of the number of positive samples to negative samples in the training set. The value of was set to 0.005.
3.2. Evaluation Metrics
We used two metrics, namely, the area under the receiver operating characteristic (AUROC) [
29] and the area under the precision–recall curve (AUPRC) [
30], to evaluate the performance of our model. These two metrics are widely used for evaluating the performance of binary classification models. The AUROC measures the trade-off between the true positive rate (TPR) and the false positive rate (FPR) across different classification thresholds. It represents the probability that a randomly selected positive sample will be ranked higher than a randomly selected negative sample by the classifier. In contrast, the AUPRC evaluates the trade-off between the precision and recall across different classification thresholds. It provides a more comprehensive assessment of the classifier’s performance, especially when dealing with imbalanced datasets where the positive and negative classes are significantly unequal.
3.3. Comparison with Existing Methods
In this section, we present the results of the local leave-one-out 10-time 10-fold cross-validation to compare the proposed PSCE method with six representative methods to examine the robustness and effectiveness of our PSCE for discovering novel drug candidates for new diseases without any treatment information on four datasets, which are mentioned in
Section 2.1. The six representative methods were MBiRW [
22], BNNR [
24], iDrug [
25], NIMCGCN [
26], DRHGCN [
27], and DRWBNCF [
28]. In this experiment, we used the AUROC and AUPRC metrics to evaluate the performances of methods.
Table 1 presents the quantization results of our PSCE method compared with six existing methods. In this table, we highlighted the best and second-best performances in red and blue, respectively. The results demonstrate that our method consistently achieved the best performance across the Gdataset, Cdataset, and Ldataset. In the LRSSL dataset experiment, although our method attained the second-best performance for the AUROC metric, it still achieved the best performance for the AUPRC metric. The last column of the table displays the mean performances across the four datasets, where it shows that our method performed well on all datasets and achieved comprehensive optimality.
In
Figure 2, we visualized the mean performance of this experiment on four datasets using a bar chart. This figure demonstrates that our PSCE method outperformed the others and achieved a state-of-the-art performance. To intuitively demonstrate the robustness and effectiveness, we visualized the performance of the 10-time 10-fold cross-validation for each time in
Figure 3. We can observe that our method, like other methods, demonstrated consistent results across repeated experiments, where the outcomes remained within a certain range and exhibited no significant random fluctuations. This indicates that the quantification results of our method are robust. The performance stability of our PSCE method was evident, where it consistently maintained a high performance. This visual confirmation aligned with the quantitative results presented in our table, which further verified the effectiveness of our method. Additionally, this stability across various datasets underscored the reliability of our approach in different experimental conditions. The robustness of our method ensures that it can be confidently applied in practical scenarios since it maintains accuracy and efficiency. Overall, these observations highlight the strength and dependability of our PSCE method in achieving superior quantification results. Compared with the existing methods, especially NIMCGCN and DRHGCN, which are also GCN-based methods, the proposed PSCE learned additional potential knowledge with subcategory pseudo-labels, and the experimental results demonstrated that our method could indeed achieve better and more robust performance than the existing ones.
3.4. Ablation Study on the Proposed PFEM and D3TC
This work presents a novel drug-repositioning model (PSCE) that incorporates two modules: PFEM and D3TC, which were designed for subcategory exploration. To investigate the effectiveness of these two modules, we conducted the ablation study detailed in this section. In these experiments, we compared the impacts of different combinations of the two modules, with the quantification results reported in
Table 2.
In
Table 2 and
Table 3, we see that when using each module individually, only a comparable performance could be achieved. However, combining both modules yielded the best performance; this even led to significant improvements, such as an increase of about 0.1–0.2 on the Cdataset. This not only indicates that both modules are effective but also that they are complementary. By integrating the two modules, they can leverage each other’s strengths, thus resulting in superior performance.
Figure 4 visually illustrates the quantization results described above with a line chart. We can observe that combining the two modules achieved significant and stable improvements over using them individually. This visual representation further validated the effectiveness of our method.
4. Conclusions
In conclusion, our proposed PSCE model represents a significant advancement in the field of drug repositioning by effectively incorporating subcategory information into the prediction process. Through the innovative use of a prototype-based feature-enhancement mechanism (PFEM) and a dual-task classification head (D3TC), we demonstrated that it is possible to achieve more precise and reliable drug–disease association predictions. The PFEM’s clustering centroids and the D3TC’s subcategory exploration enable our model to leverage finer-grained pseudo-labels, thus providing a richer source of information compared with traditional binary classification methods. Experimental results on four public datasets confirmed that our PSCE model outperformed the current state-of-the-art approaches, which underscored the potential of our method to improve the accuracy and efficiency of drug-repositioning tasks. The effectiveness of both PFEM and D3TC was further validated through comprehensive ablation studies, which highlighted the robustness and applicability of our approach.