1. Introduction
Computational biology, a multidisciplinary field at the intersection of biology and computational science, has revolutionized our understanding of biological systems by employing computational techniques to analyze complex biological data [
1,
2]. In recent years, the integration of machine learning (ML) methods within computational biology has emerged as a powerful approach to unraveling biological complexities and making predictions about biological phenomena [
3,
4,
5,
6,
7,
8]. This integration lays the foundation for leveraging computational techniques to address the challenges associated with predicting drug synergy, a critical aspect of advancing combination therapy in drug discovery and personalized medicine.
ML techniques for drug synergy prediction fall into two main categories, classic ML-based methods [
9,
10,
11,
12,
13] and deep learning (DL)-driven approaches [
14,
15,
16,
17,
18]. Among the latter, the most advanced predictive models for drug synergy utilize graph convolutional layers, leveraging the advantages and effectiveness of graph-structured data in capturing intricate relationships among biological entities. GraphSynergy [
19] employs a spatial-based graph convolutional network (GCN) component to embed the topological relationships of drug-targeted protein modules and cancer cell line-associated protein modules within the protein–protein interaction (PPI) network. An attention mechanism is utilized in GraphSynergy to identify crucial proteins involved in biomolecular interactions between drug combinations and cancer cell lines. DeepDDS [
20] is another graph neural network (GNN)-based DL approach for drug synergy prediction. It encodes the gene expression profiles of cancer cell lines through a multilayer perceptron and embeds the simplified molecular-input line-entry system (SMILES) of drugs using GCN or graph attention network based on molecular graphs. The embedded drug and cell line features are then concatenated and fed into a fully connected network for drug synergy prediction. AttenSyn [
21] adapts a graph-based drug-embedding module, comprising multiple GCN and long short-term memory (LSTM) layers, alongside an attention-based pooling module, to capture drug and cell line features. These embeddings are then concatenated, and a prediction module is employed to make final predictions regarding synergistic drug combinations.
Despite these encouraging reports, some shortcomings are evident. First, drug information and cell line features undergo independent processing, followed by their additive combination. However, this methodology fails to adequately capture the complex biological relationships between molecular and cellular entities. Second, SMILES are employed to represent drug features, focusing on their chemical structure rather than the intricate interactions between drugs and proteins. While models created based on such data may achieve high prediction accuracy, they risk lacking generalizability to unseen data, potentially memorizing existing patterns rather than capturing the underlying dynamics of drug–cell interactions. In response to these challenges, we recently developed SynerGNet [
22], a GNN-based DL model designed to predict the synergistic effect of drug combinations. Distinguishing itself from other methods, SynerGNet constructs feature-rich graphs by directly integrating heterogeneous molecular and cellular attributes into the human PPI network. Furthermore, it utilizes drug–protein association scores as drug features, rather than relying solely on SMILES representations. This approach facilitates SynerGNet in capturing the intricate biological relationships between drugs and cell lines, moving beyond simple memorization of drug properties. Indeed, comprehensive benchmarking calculations [
22] indicated that SynerGNet outperforms the random forest model by 13.5% in balanced accuracy (BAC) and 11.2% in area under the curve (AUC), while reducing the false positive rate (FPR) by 40.4% in 5-fold cross-validation. Moreover, on an independent validation dataset from DrugCombDB [
23], SynerGNet exhibited superior performance over PRODeepSyn [
24], achieving a 19.6% higher BAC and reducing the FPR by 22.4%. These findings underscore the high predictive capability of SynerGNet in drug synergy prediction and its robust generalization to unseen data.
Data scarcity poses a significant challenge in developing robust ML models within computational biology, particularly in drug discovery, where model performance depends heavily on the quality, quantity, and contextual relevance of training data. In the realm of drug synergy prediction, addressing this challenge is vital. Data augmentation techniques, such as SMILES enumeration [
25], reversing the order of drugs in drug pairs [
26], and data up-sampling [
27] are widely employed to expand the volume of available synergy data. Recently, an advanced synergy data augmentation protocol has been proposed, which involves replacing drugs in a pair with chemically and pharmacologically similar compounds [
28]. With the availability of advanced models and ample training data, it becomes imperative to delve into the dynamics of model training when faced with substantial volumes of data. This exploration allows us to understand the nuances of model behavior and performance under varying data conditions, shedding light on the underlying mechanisms driving model learning and adaptation.
In the landscape of machine learning, where the pursuit of predictive accuracy often stands as the primary goal, a counterintuitive phenomenon can emerge during model training. This phenomenon challenges conventional notions by revealing that the integration of more data into the training set can lead to an increase in loss, seemingly contradicting the expectation of improved performance. Concurrently, however, it paradoxically enhances predictive accuracy, signifying a deeper complexity within the dynamics of model learning. Regularization mechanisms, on the other hand, have proven instrumental in mitigating overfitting [
29,
30,
31]. By stabilizing training and preventing the model from learning spurious patterns in the data, regularization plays a crucial role in ensuring the robustness and generalizability of predictive models. With sufficient data at hand, gradually including more training data, instead of incorporating all available data at once, can unveil insights into various facets of model behavior and performance. First, it provides an opportunity to observe the model adaptability to incremental changes in the training data, shedding light on its capacity to learn and generalize from diverse examples over time. Second, it allows for the systematic exploration of model convergence and stability, unveiling patterns and trends in the learning dynamics as it encounters increasing volumes of data. Moreover, the incremental introduction of data enables detailed scrutiny of the interplay between model complexity, data quantity, and computational resources, offering valuable insights into the trade-offs and optimizations involved in model training. Additionally, this approach facilitates the identification and diagnosis of potential issues such as overfitting or underfitting, revealing nuances in the model performance that may not be apparent when training with all data at once.
In this study, we delve into the detailed construction process of SynerGNet, exploring its architectural design and the selection of graph convolutional layers. Furthermore, we investigate the counterintuitive phenomenon observed during model training, evaluate the effectiveness of strong regularization techniques, and analyze the impact of augmented data integration on predictive performance. Through these investigations, we aim to provide insights into the nuanced dynamics of model learning and offer valuable guidance for optimizing predictive modeling in biomedical research.
3. Results
3.1. Feature Test with a Baseline Model
Performing drug synergy prediction against cancer cell lines requires molecular and cellular features. The drug affinity score (1-dimensional, 1D) was selected as the molecular feature while gene expression (1D), copy number variation (3D), mutation (13D), and GO terms (200D) were chosen as the cellular features. To ensure the selected features can improve the model performance, reduce overfitting, and enhance interpretability, they were tested using a baseline model in the following manner.
Initially, the model was trained using drug affinity scores and GO terms as input features. A five-fold cross-validation was conducted on the original synergy dataset to evaluate the model performance. Subsequently, the remaining cellular features were integrated into the input to analyze their effect on model performance as additional features were introduced. The choice of employing GO terms as the initial cellular feature was motivated by its higher dimensionality compared to other cellular features, enabling the exploration of a broader feature space initially. The first two rows of
Table 1 illustrate the comparison results of the test. It is evident that with the inclusion of more features, the model exhibits improved performance across all evaluation metrics. Notably, there is a 1.4% enhancement in the BAC, accompanied by a 3.3% reduction in the FPR. This performance improvement suggests that by incorporating all cellular features, the model enhances its predictive accuracy, achieving a better equilibrium between sensitivity and specificity, and demonstrating greater precision in identifying negative instances.
3.2. Augmented Data Quality Test with a Baseline Model
To assess the effectiveness of the augmented synergy data, we employed a practical evaluation approach, i.e., training the baseline model exclusively using the augmented data and subsequently testing the model against the original data. This evaluation process was conducted through a five-fold cross-validation. Specifically, the original dataset was partitioned into five approximately equal subsets. In each iteration of the cross-validation, four subsets were designated as the “originating training set”, while the remaining one served as the test set. Augmented data generated from the “originating training set” was sampled and exclusively used in model training. The trained model was then evaluated on the original test set. This five-fold cross-validation procedure was repeated three times, with each round employing different samples of augmented data.
It should be noted that the “originating training set” was employed solely for augmented data sampling, and extra care was taken to avoid including augmented data generated from the test subset in the training process. The quantity of sampled augmented data in the training set was matched to the amount of original data, ensuring an equivalent representation. This involved replacing the original data in the training set with an equivalent volume of sampled augmented data, and exclusively using this augmented set to train the model.
Table 1 presents the performance of the model trained with original data alongside the performance of the model trained with augmented data. The model trained against augmented data demonstrates comparative performance with the model trained against original data across various evaluation metrics across all three samples. This indicates the effectiveness and reliability of the augmentation procedure, affirming the quality of the augmented dataset.
It is also important to note that the performance comparison presented in
Table 1 was designed to assess the quality and effectiveness of the augmented dataset independently of the original data. Importantly, the purpose was not to demonstrate performance improvement over the original dataset but rather to validate the augmentation procedure itself. The decision to train a separate model solely on the augmented dataset was deliberate, aiming to isolate the impact of the augmentation process on model training. This approach ensures a focused evaluation of the utility of augmented data in predicting drug synergy. To mitigate the risk of creating inaccurate labels within the augmented dataset, stringent similarity thresholds and quality checks guided the augmentation procedure [
28], minimizing the inclusion of irrelevant instances and emphasizing biologically relevant data points. This evaluation strategy underscores a commitment to rigorously validating the augmented dataset and affirming its role in enhancing model training for drug synergy prediction. By demonstrating comparable performance metrics between models trained against the augmented and the original data, we establish the augmented dataset effectiveness and reliability in this experimental framework.
3.3. Performance of GNN with Different Convolutional Layers
Building upon the insights gained from the baseline model, we developed the final model, SynerGNet, to enhance prediction accuracy and handle complex data patterns more effectively. The final model extends the foundational elements of the baseline model with more advanced convolutional layers and structural additions. Training data comprised two parts, the original synergy data and an equal amount of sampled augmented synergy data. The inclusion of augmented data in this experiment was aimed at validating the model ability to effectively process and learn from these data. This step ensures that the model performance is robust, not just with the original instances, but also with varied, augmented data. To ensure that model performance differences were solely due to the convolutional layers and not influenced by dataset variability, all models were trained using the same set of sampled augmented data. This methodological choice provided a controlled environment for a fair and accurate comparison of the layer efficacy.
Five candidate convolutional layers listed in
Table 2 were evaluated with a five-fold cross-validation protocol. The original data set was divided into five equal parts. In each iteration of the validation process, four subsets, along with their augmented counterparts, were used for training the model. The remaining subset, which had not been seen by the model during training, was then used for validation. From the performance reported in
Table 2, it is evident that within the scope of the five convolutional layers examined, the model achieves the best performance with GENConv. This configuration demonstrates a superior performance relative to the other four configurations across a range of metrics with an AUC of 0.747, a BAC of 0.696, precision (PPV) of 0.877, and Matthews correlation coefficient (MCC) of 0.348. Concurrently, it yields the lowest FPR of 0.338.
In addition to prediction accuracy, time efficiency is also a critical factor in evaluating model performance. Utilizing an NVIDIA V100 GPU, the training duration per epoch for SAGEConv, GATv2Conv, GINConv, TransformerConv, and GENConv are 8.6, 18.8, 7.9, 37.0, and 15.8 s, respectively. These results indicate that the GNN model utilizing GENConv strikes an effective balance between high accuracy and time efficiency, making it a compelling choice for applications where both factors are important. Based on the above results, the model employing GENConv was chosen and designated as SynerGNet, dedicated to executing synergy prediction and examining the impact of augmented data integration during the training phase. The final architectural layout of SynerGNet is presented in
Figure 1.
3.4. Initial Benchmarks of SynerGNet with Augmented Data
Initial benchmarking calculations were conducted using a training set that included 16 times the amount of augmented data as the original data, without changing the model parameters or regularization strength. The results in terms of the loss and predictive accuracy are presented in
Figure 2. Interestingly, the testing loss began to rise after a certain number of epochs (
Figure 2A). On the other hand, despite this increase in testing loss, both testing accuracy and balanced accuracy continued to improve (
Figure 2B). Typically, a decrease in training loss accompanied by an increase in testing loss indicates that the model is overfitting the training set and performs poorly on unseen data. However, this unexpected outcome suggests that the model is effectively learning more robust and discriminative features from the data, despite the apparent increase in testing loss. It is commonly assumed that accuracy and loss exhibit an inverse relationship, as improved predictions typically lead to lower loss and higher accuracy. Our observations suggest that this relationship is not always strictly inverse. One reason for this could be that while the loss quantifies the disparity between raw model output (logits) and class labels (0 or 1), the accuracy evaluates the alignment between thresholded outputs (0 or 1) and the true labels. Variations in raw outputs directly impact the loss function, reflecting deviations from the true labels. In contrast, accuracy displays greater resilience, necessitating substantial shifts in outputs to influence classification outcomes by surpassing decision thresholds.
To delve deeper into this observation, we conducted a thorough examination of the softmax output for each instance at epochs 97 and 150. Our analysis revealed that while the test loss increased from 0.605 at epoch 97 to 0.614 at epoch 150, the testing accuracy also showed an improvement from 0.679 to 0.702, and the balanced accuracy increased from 0.659 to 0.687. Out of the 654 testing instances (comprising 501 synergistic and 153 antagonistic instances, representing 20% of the entire original dataset), 444 instances (349 synergistic and 95 antagonistic) were correctly labeled at epoch 97, and 459 instances (358 synergistic and 101 antagonistic) were correctly labeled at epoch 150. Notably, 425 instances were correctly labeled at both epochs.
Table 3 outlines six different scenarios for the loss change from epoch 97 to 150. Scenario 1 comprises instances correctly predicted at both epochs with improved prediction probability at epoch 150 compared to epoch 97, indicating greater prediction confidence for these instances. There are a total of 257 such instances, resulting in an average loss decrease of 0.072. Scenario 2 represents instances correctly labeled at both epochs but exhibiting degraded prediction probability at epoch 150 compared to epoch 97, implying decreased prediction confidence for these instances. In total, there are 168 such instances, resulting in an average loss increase of 0.081. Instances belonging to Scenario 3 were correctly labeled at epoch 97 but incorrectly labeled at epoch 150. A total of 19 such instances resulted in an average loss increase of 0.283. Scenario 4 includes instances mislabeled at epoch 97 but correctly labeled at epoch 150. A total of 34 instances exhibit this pattern, resulting in an average loss decrease of 0.290. Scenario 5 contains instances mislabeled at both epochs but displaying improved prediction probability at epoch 150 compared to epoch 97. A total of 59 instances demonstrate this trend, resulting in an average loss decrease of 0.169. Finally, Scenario 6 represents instances mislabeled at both epochs and exhibiting worse prediction probability at epoch 150 compared to epoch 97. There are 117 such instances, resulting in an average loss increase of 0.292.
When good predictions degrade slightly (Scenario 2) and poor predictions deteriorate further (Scenario 6), it leads to an increase in loss while maintaining the same accuracy. As the cross-entropy loss employs the logarithm of the output, erroneous predictions are penalized much more severely than correct predictions are rewarded. This is evident from Scenarios 3 and 4, where 34 accurate predictions yield a loss decrease of 0.290, while 19 incorrect predictions result in a loss increase of 0.283. Scenario 5 yields some loss reduction but does not impact accuracy, as the prediction probability fails to surpass the threshold. The combination of these scenarios results in an increase in accuracy and balanced accuracy but also an increase in the loss. When both accuracy and loss increase simultaneously, the model starts to overfit and some instances from the testing are incorrectly classified (as observed in Scenarios 3 and 6), with the effect amplified by the “loss asymmetry”. However, the model persists in learning useful patterns for generalization, exemplified by Scenario 4, where more instances are correctly classified. This raises the question of whether to stop the learning process upon the emergence of spurious patterns, given the concurrent acquisition of useful insights that contribute to improved performance [
44].
3.5. Strong Regularization for Robust Learning
While the overfitting phenomenon discussed above may not inherently denote a negative outcome, it is essential to prevent the model from learning training-specific patterns and maintain its ability to generalize to unseen data. To achieve this goal, batch normalization, L2 regularization, and dropout layers were employed. Batch normalization addresses the issue of internal covariate shift, stabilizing the training process by normalizing the activations of each layer. It not only accelerates convergence but also improves the model generalization performance by reducing the sensitivity to the scale and distribution of input features. L2 regularization adds a penalty term to the loss function proportional to the sum of the squared weights of the model. By penalizing large weights, L2 regularization discourages the model from fitting noise in training instances and promotes a smoother decision boundary, leading to improved generalization to unseen data. Dropout mitigates overfitting by introducing stochasticity into the training process. By randomly deactivating a fraction of units during each training iteration, dropout prevents co-adaptation among neurons and encourages the model to learn more robust features. This technique effectively reduces the model reliance on specific activations, thereby enhancing its generalization ability.
L2 regularization is implemented in SynerGNet by adjusting the weight decay parameter within the Adam optimization algorithm. Initially, when the amount of augmented data added to the training set is less than 16 times the original training data, this parameter is set to 10
−6. However, once the augmented data surpasses 16 times the original training data, the parameter is increased to 0.01, thereby increasing the regularization strength and inducing greater weight shrinkage. In the dropout layer, the dropout probability is set to 0.65 when the augmented data remains less than 16 times the original training data. This implies that 65% of the input units are randomly zeroed out during training. Subsequently, when the augmented data reaches 16 times the original training data, the dropout probability is raised to 0.75. By increasing the number of input units randomly zeroed out, the model is encouraged to rely on a wider array of features during training, thereby enhancing the regularization effect and mitigating overfitting.
Figure 2C,D demonstrate that this strong regularization mechanism effectively eliminated the upward trend observed in the testing loss curve in
Figure 2A maintaining the consistent improvement in predictive accuracy (
Figure 2B,D).
It is noteworthy that these regularization techniques were not merely employed as routine practices but were strategically integrated and implemented to enhance model robustness and generalization capabilities. Importantly, our adaptation of these techniques in SynerGNet includes nuanced adjustments, such as dynamically tuning the L2 regularization strength and dropout probability based on the scale of augmented data, which significantly contributed to maintaining model performance and preventing overfitting throughout training. This rigorous regularization approach is integral to our methodology and underscores its role in achieving superior predictive accuracy and generalization, as demonstrated in our experimental results.
3.6. Gradual Integration of Augmented Data
Rather than utilizing all 2,315,325 augmented synergy instances at once, we aim to explore the impact of gradually integrating augmented data during the training phase on the model predictive performance. We seek to understand whether the incremental addition of augmented data consistently improves the model generalization ability. In these benchmarks, a five-fold cross-validation was employed. The original synergy dataset was first partitioned into five equal subsets while maintaining the same synergistic/antagonistic ratio as the entire original dataset. Next, for each cross-validation cycle, four subsets were allocated for training, while the remaining subset served as the test set. To gradually incorporate augmented data during the training phase, we first included an amount of augmented data equivalent to the original dataset. We then conducted five additional rounds of five-fold cross-validation, each time doubling the amount of augmented data in the training set. Throughout all experiments, the test set exclusively comprised the original instances with experimental labels.
With the gradual integration of augmented data into the training set and an appropriate regularization strength, the model consistently exhibited enhanced performance. Detailed performance evaluation is reported in
Table 4. To start with, we conducted a five-fold cross-validation exclusively on the original data, utilizing this performance as the benchmark for comparison. When trained exclusively on the original data, the model yields an AUC of 0.721, a BAC of 0.676, and a PPV of 0.863. The FPR is 0.380 and the MCC between the predicted label and the true label is 0.313. Then we added an amount of augmented data equal to the size of the original data for training in each fold of the five-fold cross-validation and tested the trained model on the original test set. The model showed improved performance across all metrics. Specifically, the AUC increased by 1.8%, the BAC increased by 1.6%, the PPV increased by 0.9%, the FPR reduced by 2.1%, and the MCC increased by 3.1%.
Table 4 and
Figure 3 show that with the gradual inclusion of more augmented data into the training set, the model consistently demonstrated improved performance.
Figure 3A illustrates the increasing trend of the BAC as a greater amount of augmented data were incorporated into the training process.
Figure 3B shows that the testing loss kept decreasing as the model was exposed to a greater variety of training instances, which demonstrates that the inclusion of augmented data enhances the model’s ability to generalize. With the quantity of augmented data 32 times the size of the original training data, notable enhancements were observed across various performance metrics and the AUC increased by 5.8%, the BAC by 5.5%, and the PPV by 3.0%. Simultaneously, the FPR saw a decrease of 7.8%, and the MCC between the predicted and true labels increased by 10%. It is important to note that, in each case, five distinct samples of augmented data were employed. Therefore,
Table 4 and
Figure 3 report the performance as the average ±standard deviation across these five samples. The use of multiple samples of augmented data ensures the robustness of the results, accurately representing the dataset and instilling statistical confidence in the findings.
4. Discussion
The development of SynerGNet predicting drug synergy against cancer cell lines provides a compelling tool in the field of computational biology and pharmacology. By leveraging a combination of advanced ML techniques, comprehensive feature selection, and careful evaluation methodologies, we have demonstrated the potential for robust and accurate prediction of the synergistic effects of drug combinations. One of the key findings of our study is the effectiveness of data augmentation in addressing the challenge of limited data available. The augmentation process, which involves generating additional instances by replacing drugs in drug pairs based on similarity scores, significantly expanded the dataset, and improved the generalization ability of the model. The comparable performance observed between models trained on the original data and those trained on an equivalent amount of augmented data underscores the quality and dependability of the augmented dataset. Furthermore, our analysis of feature importance underscores the significance of comprehensive feature selection in enhancing prediction performance. By incorporating molecular and cellular features into our model, we were able to capture a diverse range of biological factors influencing drug synergy. The incremental addition of features led to improvements in model performance, indicating the importance of considering multiple dimensions of biological data in predictive modeling.
The detailed analysis of the interplay between testing loss and model performance underscores the nuanced dynamics within the model. While an increase in testing loss usually signals overfitting and reduced generalization, our findings indicate a more complex relationship. Throughout our investigation, we identified scenarios where the model performance improved despite a rise in testing loss. This intriguing discovery suggests that an elevation in testing loss may not necessarily warrant the termination of model training, especially if the model continues to incorporate more resilient and discriminative patterns. This underscores the importance of conducting in-depth analyses to elucidate the underlying mechanisms that shape model performance. Additionally, our study demonstrates the importance of strong regularization techniques in preventing overfitting and maintaining model generalization. By employing batch normalization, L2 regularization, and dropout layers, we were able to stabilize training, reduce overfitting, and improve model performance. The careful tuning of regularization parameters played a crucial role in ensuring the model robustness and reliability.
Our investigation into the impact of incremental augmented data inclusion provides valuable insights into the scalability and effectiveness of our approach. By gradually increasing the amount of augmented data in the training set, we observed consistent improvements across various performance metrics. This highlights the potential for further enhancements through the continued integration of augmented data and the importance of computational resources in supporting large-scale training efforts. The accuracy of SynerGNet can further be improved by integrating advanced graph convolutional layers, such as SuperGATConv [
45], EGConv [
46], and SSGConv [
47], along with developing a more sophisticated model structure. This includes multi-scale feature integration, where different branches or layers focus on various scales of input data, residual connections to alleviate the vanishing gradient problem and facilitate deeper network training, and attention mechanisms to dynamically weigh the importance of protein interactions. These enhancements are designed to better process complex biological data, enabling more accurate and nuanced predictions, thus improving the overall performance of SynerGNet.
In conclusion, SynerGNet is a powerful tool for predicting drug synergy against cancer cell lines. Through a combination of data augmentation, comprehensive feature selection, model construction, and regularization techniques, we have demonstrated the potential for accurate and reliable prediction of drug interactions. Our findings contribute to the ongoing efforts to harness the power of computational approaches in drug discovery and personalized medicine, offering new opportunities for accelerating the development of effective cancer treatments.