Insights from Augmented Data Integration and Strong Regularization in Drug Synergy Prediction with SynerGNet

Liu, Mengmeng; Srivastava, Gopal; Ramanujam, J.; Brylinski, Michal

doi:10.3390/make6030087

Open AccessArticle

Insights from Augmented Data Integration and Strong Regularization in Drug Synergy Prediction with SynerGNet

by

Mengmeng Liu

¹,

Gopal Srivastava

²

,

J. Ramanujam

^1,3

and

Michal Brylinski

^2,3,*

¹

Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA 70803, USA

²

Department of Biological Sciences, Louisiana State University, Baton Rouge, LA 70803, USA

³

Center for Computation and Technology, Louisiana State University, Baton Rouge, LA 70803, USA

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2024, 6(3), 1782-1797; https://doi.org/10.3390/make6030087

Submission received: 24 May 2024 / Revised: 17 July 2024 / Accepted: 25 July 2024 / Published: 29 July 2024

(This article belongs to the Special Issue Machine Learning in Data Science)

Download

Browse Figures

Versions Notes

Abstract

SynerGNet is a novel approach to predicting drug synergy against cancer cell lines. In this study, we discuss in detail the construction process of SynerGNet, emphasizing its comprehensive design tailored to handle complex data patterns. Additionally, we investigate a counterintuitive phenomenon when integrating more augmented data into the training set results in an increase in testing loss alongside improved predictive accuracy. This sheds light on the nuanced dynamics of model learning. Further, we demonstrate the effectiveness of strong regularization techniques in mitigating overfitting, ensuring the robustness and generalization ability of SynerGNet. Finally, the continuous performance enhancements achieved through the integration of augmented data are highlighted. By gradually increasing the amount of augmented data in the training set, we observe substantial improvements in model performance. For instance, compared to models trained exclusively on the original data, the integration of the augmented data can lead to a 5.5% increase in the balanced accuracy and a 7.8% decrease in the false positive rate. Through rigorous benchmarks and analyses, our study contributes valuable insights into the development and optimization of predictive models in biomedical research.

Keywords:

drug synergy prediction; graph neural networks; computational biology; regularization in deep learning; data augmentation; model evaluation

1. Introduction

Computational biology, a multidisciplinary field at the intersection of biology and computational science, has revolutionized our understanding of biological systems by employing computational techniques to analyze complex biological data [1,2]. In recent years, the integration of machine learning (ML) methods within computational biology has emerged as a powerful approach to unraveling biological complexities and making predictions about biological phenomena [3,4,5,6,7,8]. This integration lays the foundation for leveraging computational techniques to address the challenges associated with predicting drug synergy, a critical aspect of advancing combination therapy in drug discovery and personalized medicine.

ML techniques for drug synergy prediction fall into two main categories, classic ML-based methods [9,10,11,12,13] and deep learning (DL)-driven approaches [14,15,16,17,18]. Among the latter, the most advanced predictive models for drug synergy utilize graph convolutional layers, leveraging the advantages and effectiveness of graph-structured data in capturing intricate relationships among biological entities. GraphSynergy [19] employs a spatial-based graph convolutional network (GCN) component to embed the topological relationships of drug-targeted protein modules and cancer cell line-associated protein modules within the protein–protein interaction (PPI) network. An attention mechanism is utilized in GraphSynergy to identify crucial proteins involved in biomolecular interactions between drug combinations and cancer cell lines. DeepDDS [20] is another graph neural network (GNN)-based DL approach for drug synergy prediction. It encodes the gene expression profiles of cancer cell lines through a multilayer perceptron and embeds the simplified molecular-input line-entry system (SMILES) of drugs using GCN or graph attention network based on molecular graphs. The embedded drug and cell line features are then concatenated and fed into a fully connected network for drug synergy prediction. AttenSyn [21] adapts a graph-based drug-embedding module, comprising multiple GCN and long short-term memory (LSTM) layers, alongside an attention-based pooling module, to capture drug and cell line features. These embeddings are then concatenated, and a prediction module is employed to make final predictions regarding synergistic drug combinations.

Despite these encouraging reports, some shortcomings are evident. First, drug information and cell line features undergo independent processing, followed by their additive combination. However, this methodology fails to adequately capture the complex biological relationships between molecular and cellular entities. Second, SMILES are employed to represent drug features, focusing on their chemical structure rather than the intricate interactions between drugs and proteins. While models created based on such data may achieve high prediction accuracy, they risk lacking generalizability to unseen data, potentially memorizing existing patterns rather than capturing the underlying dynamics of drug–cell interactions. In response to these challenges, we recently developed SynerGNet [22], a GNN-based DL model designed to predict the synergistic effect of drug combinations. Distinguishing itself from other methods, SynerGNet constructs feature-rich graphs by directly integrating heterogeneous molecular and cellular attributes into the human PPI network. Furthermore, it utilizes drug–protein association scores as drug features, rather than relying solely on SMILES representations. This approach facilitates SynerGNet in capturing the intricate biological relationships between drugs and cell lines, moving beyond simple memorization of drug properties. Indeed, comprehensive benchmarking calculations [22] indicated that SynerGNet outperforms the random forest model by 13.5% in balanced accuracy (BAC) and 11.2% in area under the curve (AUC), while reducing the false positive rate (FPR) by 40.4% in 5-fold cross-validation. Moreover, on an independent validation dataset from DrugCombDB [23], SynerGNet exhibited superior performance over PRODeepSyn [24], achieving a 19.6% higher BAC and reducing the FPR by 22.4%. These findings underscore the high predictive capability of SynerGNet in drug synergy prediction and its robust generalization to unseen data.

Data scarcity poses a significant challenge in developing robust ML models within computational biology, particularly in drug discovery, where model performance depends heavily on the quality, quantity, and contextual relevance of training data. In the realm of drug synergy prediction, addressing this challenge is vital. Data augmentation techniques, such as SMILES enumeration [25], reversing the order of drugs in drug pairs [26], and data up-sampling [27] are widely employed to expand the volume of available synergy data. Recently, an advanced synergy data augmentation protocol has been proposed, which involves replacing drugs in a pair with chemically and pharmacologically similar compounds [28]. With the availability of advanced models and ample training data, it becomes imperative to delve into the dynamics of model training when faced with substantial volumes of data. This exploration allows us to understand the nuances of model behavior and performance under varying data conditions, shedding light on the underlying mechanisms driving model learning and adaptation.

In the landscape of machine learning, where the pursuit of predictive accuracy often stands as the primary goal, a counterintuitive phenomenon can emerge during model training. This phenomenon challenges conventional notions by revealing that the integration of more data into the training set can lead to an increase in loss, seemingly contradicting the expectation of improved performance. Concurrently, however, it paradoxically enhances predictive accuracy, signifying a deeper complexity within the dynamics of model learning. Regularization mechanisms, on the other hand, have proven instrumental in mitigating overfitting [29,30,31]. By stabilizing training and preventing the model from learning spurious patterns in the data, regularization plays a crucial role in ensuring the robustness and generalizability of predictive models. With sufficient data at hand, gradually including more training data, instead of incorporating all available data at once, can unveil insights into various facets of model behavior and performance. First, it provides an opportunity to observe the model adaptability to incremental changes in the training data, shedding light on its capacity to learn and generalize from diverse examples over time. Second, it allows for the systematic exploration of model convergence and stability, unveiling patterns and trends in the learning dynamics as it encounters increasing volumes of data. Moreover, the incremental introduction of data enables detailed scrutiny of the interplay between model complexity, data quantity, and computational resources, offering valuable insights into the trade-offs and optimizations involved in model training. Additionally, this approach facilitates the identification and diagnosis of potential issues such as overfitting or underfitting, revealing nuances in the model performance that may not be apparent when training with all data at once.

In this study, we delve into the detailed construction process of SynerGNet, exploring its architectural design and the selection of graph convolutional layers. Furthermore, we investigate the counterintuitive phenomenon observed during model training, evaluate the effectiveness of strong regularization techniques, and analyze the impact of augmented data integration on predictive performance. Through these investigations, we aim to provide insights into the nuanced dynamics of model learning and offer valuable guidance for optimizing predictive modeling in biomedical research.

2. Materials and Methods

2.1. Original and Augmented Synergy Datasets

The original drug synergy dataset was compiled based on the AstraZeneca dialog for reverse engineering assessments and methods (AZ-DREAM) Challenges dataset, encompassing 11,576 experiments involving 910 combinations of 118 drugs across 85 molecularly characterized cancer cell lines. Synergy and antagonism were calculated and summarized via a single synergy score. As synergy score of +/−20 is the limit at which synergy and antagonism can be visually confirmed [10]; only instances with synergy scores higher than +20 or lower than −20 can be utilized for synergy classification. Applying these criteria yields 3272 efficient instances, a rather insufficient amount of data to conduct an effective and meaningful learning of a ML model.

To expand the amount of available synergy data, an augmented dataset was generated from the original dataset. In the AZ-DREAM Challenges data, each instance comprises a drug pair targeting a cancer cell line with an experimentally determined synergy score. The data augmentation protocol employs the drug action/chemical similarity (DACS) metric to select candidate drugs for substituting each compound in the original pair [28]. The DACS score measures the similarity of two drugs in terms of their chemical structures and target proteins, with higher scores indicating a greater likelihood of triggering similar pharmacological effects. During the augmentation, each drug in a drug pair was replaced by a new drug from the search tool for interactions of chemicals (STITCH) dataset [32] given the DACS score between the new drug and its parent drug is higher than 0.53. This optimal threshold was obtained by balancing the fraction of new drugs having similar pharmacological profiles to their parent molecules and the number of new instances that can be obtained from the STITCH database. In this way, the original 3272 instances were upscaled to as many as 2,315,325 drug combinations. The original dataset includes 2507 (76.6%) synergistic and 765 (23.4%) antagonistic instances, whereas the augmented dataset comprises 1,850,037 (79.9%) synergistic and 465,288 (20.1%) antagonistic instances. Importantly, the augmented data have a similar distribution of synergy scores as the original data, with a mean value and standard deviation of 12.1 ± 28.5 compared to 9.9 ± 26.1. Additionally, the synergy/antagonism ratio is approximately preserved in the augmented dataset. It is essential to note that the original AZ-DREAM Challenges instances are crucial for evaluating the model against experimental labels. This ensures that predictions are always validated against experimental results, maintaining the reliability and accuracy of our findings. The details of the augmented dataset and the augmentation process can be found in the original paper [28].

2.2. Baseline Model

The baseline model in our study serves as a preliminary tool to evaluate feature selection and the quality of augmented data. The model is composed of a SAGEConv [33] layer, succeeded by batch normalization and the rectified linear unit (ReLU) activation. SAGEConv was employed in the baseline model for its proficiency in handling biological network data, minimal parameter tuning necessity, and computational efficiency. The batch normalization layer normalizes the output of the convolutional layer to reduce internal covariate shift and thus stabilizes and speeds up the training process. The ReLU layer introduces non-linearity to enhance the model capacity for effective feature extraction. Global pooling is employed as the readout mechanism to derive graph-level embeddings from node-level embeddings. This is followed by the integration of two fully connected layers to transform graph-level features into the final prediction outcome. This simple architecture provides a crucial benchmark for assessing initial data processing and feature extraction strategies.

2.3. Graph Convolutional Layers

Four major aspects were considered when developing the final model for drug synergy prediction, the graph convolutional layer, the graph readout mechanism, the regularization mechanism, and the overall model complexity. The following five candidate graph convolutional layers were evaluated: SAGEConv [33], GATv2Conv [34], GINConv [35], TransformerCov [36], and GENConv [37]. The theoretical foundation of SAGEConv is grounded in inductive learning for node embeddings in large graphs. It incorporates three distinct aggregators to sample and aggregate local neighborhood features, mean, LSTM, and max. This enables an efficient generation of embeddings for new, unseen nodes. It can effectively handle large, dynamic graphs by leveraging neighborhood aggregation and feature learning [33]. SAGEConv weighs all neighbors of a node with equal importance for information aggregation. GATv2Conv addresses this limitation by introducing a dynamic attention mechanism to compute attention coefficients indicating the importance of the features of a neighbor to a node. The attention coefficients in GATv2Conv are allowed to be recalculated each time based on the current state of the node features, allowing the network to adapt and respond more effectively to the changing relationships and features within the graph. This dynamic mechanism makes GATv2Conv particularly beneficial in scenarios involving intricate graph structures or noisy data, improving the network ability to learn more nuanced and context-specific representations [34].

Graph isomorphism network (GIN) advances the capability of GNNs to distinguish between different graph structures by leveraging the principles of the Weisfeiler–Lehman test [38] and incorporating injective aggregation functions for more effective feature learning. The injective aggregation function allows GIN to maintain the uniqueness of different neighborhood structures in the graph, ensuring that different graph structures lead to different node embeddings. This theoretical foundation places GIN at a significant advantage in tasks requiring fine-grained differentiation of graph structures [35]. TransformerCov employs the unified message passing model (UniMP) by combining GNNs and label propagation algorithms (LPAs). UniMP utilizes both feature and label embeddings for propagation. A key innovation is the masked label prediction strategy to prevent overfitting and improve performance. This method masks a portion of the input label information during training, enhancing the model ability to generalize from labeled to unlabeled examples [36]. GENConv incorporates a generalized aggregation function, including softmax and powermean, allowing the model to dynamically switch between different aggregation methods and adjust parameters within a method to capture different levels of importance of neighboring nodes and optimize performance for varying graph structures. The message normalization layer, MsgNorm, incorporated in GENConv enhances the stability and performance, especially in deep networks. Additionally, GENConv employs pre-activation residual connections, facilitating the efficient training of deeper models. These mechanisms collectively enable GENConv to manage complex graph structures and boost performance in large-scale graph learning tasks [37].

2.4. Other Components of the GNN Model

To derive graph-level embeddings from node-level ones, global mean pooling and global max pooling were inherited from the baseline model as the readout mechanisms because of their simplicity, efficiency, and better alignment with the data characteristics compared to other, more complex pooling methods, such as Set2Set [39], SAGPooling [40], PANPooling [41], and MemPooling [42]. Further, regularization mechanisms were incorporated into the model to mitigate the risk of overfitting. Specifically, dropout and L2 regularization were integrated to enhance the model ability to generalize and perform reliably on unseen data. The dropout layer works by randomly disregarding a subset of neurons during training, forcing the network to learn more robust features that are not reliant on specific pathways. This enhances the model generalization to unseen data. L2 regularization adds a penalty to the loss function proportional to the square of the magnitude of the weights. This discourages large weights, leading to a simpler model that is less likely to overfit. Together, these techniques promote constructing more generalizable and robust models.

2.5. Graph Representation of Input Data

Each instance consists of a drug pair, a targeted cancer cell line, and a label indicating whether the drug pair is synergistic or antagonistic when acting on that cell line. The input features comprise four cell line-specific attributes, differential gene expression (DGE), copy number variation (CNV), mutation data, and gene ontology (GO) terms, and a drug-specific feature, drug-protein associations. DGE represents variations in protein production by a gene, encoded as 0 (normally expressed), +1 (up-regulated), and −1 (down-regulated). CNV denotes the variation in gene sequence copies, categorized into wild-type, amplification, and deletion, represented as a 3-dimensional one-hot vector. Mutation features encompass 13 types, such as missense substitution and frameshift deletion, encoded as a 13-dimensional one-hot vector. GO terms, reflecting gene functions within a hierarchical structure, are transformed into 200-dimensional vectors. The drug–protein association score, quantifying the interaction strength between a drug and a specific protein, ranges from 150 to 999, with higher scores indicating greater interaction confidence. These features, forming a 218-dimensional vector for each protein, are mapped onto the human PPI network [43], where each protein functions as a node and interactions between proteins are represented as edges, forming a graph structure. Node features derived from cell line and drug-specific attributes are assigned to their corresponding proteins in the PPI network, creating a comprehensive graph representation for each instance. Each graph in the dataset typically includes around 1500 nodes and over 4000 edges.

2.6. SynerGNet Architecture

To provide robust feature learning capabilities while maintaining a manageable level of computational efficiency, the overall model is designed as follows. It starts with two GENConv layers, specialized graph convolutional layers designed to capture complex dependencies and interactions within the drug interaction graph. GENConv is highly efficient at processing graph-structured data, enabling effective feature extraction from nodes and edges representing drugs and their interactions. Following each GENConv layer, batch normalization normalizes the activations, stabilizing the training process and improving gradient flow. ReLU activation functions introduce non-linearity, crucial for learning intricate relationships in the data. The jumping knowledge network (JK-Net) follows GENConv layers to aggregate feature representations across different scales and depths. Utilizing max-pooling, JK-Net integrates the most relevant features without introducing additional hyperparameters, thereby optimizing computational efficiency while enhancing information integration. The output from JK-Net undergoes a readout mechanism that combines global max pooling and global average pooling. This approach aggregates information from the graph-level embeddings derived from JK-Net, providing a comprehensive representation of the learned features across the entire graph. Finally, to make predictions based on graph-level embeddings, two fully connected layers follow the readout mechanism. The first fully connected layer incorporates batch normalization for stability, ReLU activation for non-linearity, and dropout regularization to mitigate overfitting. The output of the model is a probability score between 0 and 1, indicating the likelihood that the drug pair is synergistic. This score aids in assessing the potential effectiveness of drug combinations in a therapeutic context.

3. Results

3.1. Feature Test with a Baseline Model

Performing drug synergy prediction against cancer cell lines requires molecular and cellular features. The drug affinity score (1-dimensional, 1D) was selected as the molecular feature while gene expression (1D), copy number variation (3D), mutation (13D), and GO terms (200D) were chosen as the cellular features. To ensure the selected features can improve the model performance, reduce overfitting, and enhance interpretability, they were tested using a baseline model in the following manner.

Initially, the model was trained using drug affinity scores and GO terms as input features. A five-fold cross-validation was conducted on the original synergy dataset to evaluate the model performance. Subsequently, the remaining cellular features were integrated into the input to analyze their effect on model performance as additional features were introduced. The choice of employing GO terms as the initial cellular feature was motivated by its higher dimensionality compared to other cellular features, enabling the exploration of a broader feature space initially. The first two rows of Table 1 illustrate the comparison results of the test. It is evident that with the inclusion of more features, the model exhibits improved performance across all evaluation metrics. Notably, there is a 1.4% enhancement in the BAC, accompanied by a 3.3% reduction in the FPR. This performance improvement suggests that by incorporating all cellular features, the model enhances its predictive accuracy, achieving a better equilibrium between sensitivity and specificity, and demonstrating greater precision in identifying negative instances.

3.2. Augmented Data Quality Test with a Baseline Model

To assess the effectiveness of the augmented synergy data, we employed a practical evaluation approach, i.e., training the baseline model exclusively using the augmented data and subsequently testing the model against the original data. This evaluation process was conducted through a five-fold cross-validation. Specifically, the original dataset was partitioned into five approximately equal subsets. In each iteration of the cross-validation, four subsets were designated as the “originating training set”, while the remaining one served as the test set. Augmented data generated from the “originating training set” was sampled and exclusively used in model training. The trained model was then evaluated on the original test set. This five-fold cross-validation procedure was repeated three times, with each round employing different samples of augmented data.

It should be noted that the “originating training set” was employed solely for augmented data sampling, and extra care was taken to avoid including augmented data generated from the test subset in the training process. The quantity of sampled augmented data in the training set was matched to the amount of original data, ensuring an equivalent representation. This involved replacing the original data in the training set with an equivalent volume of sampled augmented data, and exclusively using this augmented set to train the model. Table 1 presents the performance of the model trained with original data alongside the performance of the model trained with augmented data. The model trained against augmented data demonstrates comparative performance with the model trained against original data across various evaluation metrics across all three samples. This indicates the effectiveness and reliability of the augmentation procedure, affirming the quality of the augmented dataset.

It is also important to note that the performance comparison presented in Table 1 was designed to assess the quality and effectiveness of the augmented dataset independently of the original data. Importantly, the purpose was not to demonstrate performance improvement over the original dataset but rather to validate the augmentation procedure itself. The decision to train a separate model solely on the augmented dataset was deliberate, aiming to isolate the impact of the augmentation process on model training. This approach ensures a focused evaluation of the utility of augmented data in predicting drug synergy. To mitigate the risk of creating inaccurate labels within the augmented dataset, stringent similarity thresholds and quality checks guided the augmentation procedure [28], minimizing the inclusion of irrelevant instances and emphasizing biologically relevant data points. This evaluation strategy underscores a commitment to rigorously validating the augmented dataset and affirming its role in enhancing model training for drug synergy prediction. By demonstrating comparable performance metrics between models trained against the augmented and the original data, we establish the augmented dataset effectiveness and reliability in this experimental framework.

3.3. Performance of GNN with Different Convolutional Layers

Building upon the insights gained from the baseline model, we developed the final model, SynerGNet, to enhance prediction accuracy and handle complex data patterns more effectively. The final model extends the foundational elements of the baseline model with more advanced convolutional layers and structural additions. Training data comprised two parts, the original synergy data and an equal amount of sampled augmented synergy data. The inclusion of augmented data in this experiment was aimed at validating the model ability to effectively process and learn from these data. This step ensures that the model performance is robust, not just with the original instances, but also with varied, augmented data. To ensure that model performance differences were solely due to the convolutional layers and not influenced by dataset variability, all models were trained using the same set of sampled augmented data. This methodological choice provided a controlled environment for a fair and accurate comparison of the layer efficacy.

Five candidate convolutional layers listed in Table 2 were evaluated with a five-fold cross-validation protocol. The original data set was divided into five equal parts. In each iteration of the validation process, four subsets, along with their augmented counterparts, were used for training the model. The remaining subset, which had not been seen by the model during training, was then used for validation. From the performance reported in Table 2, it is evident that within the scope of the five convolutional layers examined, the model achieves the best performance with GENConv. This configuration demonstrates a superior performance relative to the other four configurations across a range of metrics with an AUC of 0.747, a BAC of 0.696, precision (PPV) of 0.877, and Matthews correlation coefficient (MCC) of 0.348. Concurrently, it yields the lowest FPR of 0.338.

In addition to prediction accuracy, time efficiency is also a critical factor in evaluating model performance. Utilizing an NVIDIA V100 GPU, the training duration per epoch for SAGEConv, GATv2Conv, GINConv, TransformerConv, and GENConv are 8.6, 18.8, 7.9, 37.0, and 15.8 s, respectively. These results indicate that the GNN model utilizing GENConv strikes an effective balance between high accuracy and time efficiency, making it a compelling choice for applications where both factors are important. Based on the above results, the model employing GENConv was chosen and designated as SynerGNet, dedicated to executing synergy prediction and examining the impact of augmented data integration during the training phase. The final architectural layout of SynerGNet is presented in Figure 1.

3.4. Initial Benchmarks of SynerGNet with Augmented Data

Initial benchmarking calculations were conducted using a training set that included 16 times the amount of augmented data as the original data, without changing the model parameters or regularization strength. The results in terms of the loss and predictive accuracy are presented in Figure 2. Interestingly, the testing loss began to rise after a certain number of epochs (Figure 2A). On the other hand, despite this increase in testing loss, both testing accuracy and balanced accuracy continued to improve (Figure 2B). Typically, a decrease in training loss accompanied by an increase in testing loss indicates that the model is overfitting the training set and performs poorly on unseen data. However, this unexpected outcome suggests that the model is effectively learning more robust and discriminative features from the data, despite the apparent increase in testing loss. It is commonly assumed that accuracy and loss exhibit an inverse relationship, as improved predictions typically lead to lower loss and higher accuracy. Our observations suggest that this relationship is not always strictly inverse. One reason for this could be that while the loss quantifies the disparity between raw model output (logits) and class labels (0 or 1), the accuracy evaluates the alignment between thresholded outputs (0 or 1) and the true labels. Variations in raw outputs directly impact the loss function, reflecting deviations from the true labels. In contrast, accuracy displays greater resilience, necessitating substantial shifts in outputs to influence classification outcomes by surpassing decision thresholds.

To delve deeper into this observation, we conducted a thorough examination of the softmax output for each instance at epochs 97 and 150. Our analysis revealed that while the test loss increased from 0.605 at epoch 97 to 0.614 at epoch 150, the testing accuracy also showed an improvement from 0.679 to 0.702, and the balanced accuracy increased from 0.659 to 0.687. Out of the 654 testing instances (comprising 501 synergistic and 153 antagonistic instances, representing 20% of the entire original dataset), 444 instances (349 synergistic and 95 antagonistic) were correctly labeled at epoch 97, and 459 instances (358 synergistic and 101 antagonistic) were correctly labeled at epoch 150. Notably, 425 instances were correctly labeled at both epochs. Table 3 outlines six different scenarios for the loss change from epoch 97 to 150. Scenario 1 comprises instances correctly predicted at both epochs with improved prediction probability at epoch 150 compared to epoch 97, indicating greater prediction confidence for these instances. There are a total of 257 such instances, resulting in an average loss decrease of 0.072. Scenario 2 represents instances correctly labeled at both epochs but exhibiting degraded prediction probability at epoch 150 compared to epoch 97, implying decreased prediction confidence for these instances. In total, there are 168 such instances, resulting in an average loss increase of 0.081. Instances belonging to Scenario 3 were correctly labeled at epoch 97 but incorrectly labeled at epoch 150. A total of 19 such instances resulted in an average loss increase of 0.283. Scenario 4 includes instances mislabeled at epoch 97 but correctly labeled at epoch 150. A total of 34 instances exhibit this pattern, resulting in an average loss decrease of 0.290. Scenario 5 contains instances mislabeled at both epochs but displaying improved prediction probability at epoch 150 compared to epoch 97. A total of 59 instances demonstrate this trend, resulting in an average loss decrease of 0.169. Finally, Scenario 6 represents instances mislabeled at both epochs and exhibiting worse prediction probability at epoch 150 compared to epoch 97. There are 117 such instances, resulting in an average loss increase of 0.292.

When good predictions degrade slightly (Scenario 2) and poor predictions deteriorate further (Scenario 6), it leads to an increase in loss while maintaining the same accuracy. As the cross-entropy loss employs the logarithm of the output, erroneous predictions are penalized much more severely than correct predictions are rewarded. This is evident from Scenarios 3 and 4, where 34 accurate predictions yield a loss decrease of 0.290, while 19 incorrect predictions result in a loss increase of 0.283. Scenario 5 yields some loss reduction but does not impact accuracy, as the prediction probability fails to surpass the threshold. The combination of these scenarios results in an increase in accuracy and balanced accuracy but also an increase in the loss. When both accuracy and loss increase simultaneously, the model starts to overfit and some instances from the testing are incorrectly classified (as observed in Scenarios 3 and 6), with the effect amplified by the “loss asymmetry”. However, the model persists in learning useful patterns for generalization, exemplified by Scenario 4, where more instances are correctly classified. This raises the question of whether to stop the learning process upon the emergence of spurious patterns, given the concurrent acquisition of useful insights that contribute to improved performance [44].

3.5. Strong Regularization for Robust Learning

While the overfitting phenomenon discussed above may not inherently denote a negative outcome, it is essential to prevent the model from learning training-specific patterns and maintain its ability to generalize to unseen data. To achieve this goal, batch normalization, L2 regularization, and dropout layers were employed. Batch normalization addresses the issue of internal covariate shift, stabilizing the training process by normalizing the activations of each layer. It not only accelerates convergence but also improves the model generalization performance by reducing the sensitivity to the scale and distribution of input features. L2 regularization adds a penalty term to the loss function proportional to the sum of the squared weights of the model. By penalizing large weights, L2 regularization discourages the model from fitting noise in training instances and promotes a smoother decision boundary, leading to improved generalization to unseen data. Dropout mitigates overfitting by introducing stochasticity into the training process. By randomly deactivating a fraction of units during each training iteration, dropout prevents co-adaptation among neurons and encourages the model to learn more robust features. This technique effectively reduces the model reliance on specific activations, thereby enhancing its generalization ability.

L2 regularization is implemented in SynerGNet by adjusting the weight decay parameter within the Adam optimization algorithm. Initially, when the amount of augmented data added to the training set is less than 16 times the original training data, this parameter is set to 10⁻⁶. However, once the augmented data surpasses 16 times the original training data, the parameter is increased to 0.01, thereby increasing the regularization strength and inducing greater weight shrinkage. In the dropout layer, the dropout probability is set to 0.65 when the augmented data remains less than 16 times the original training data. This implies that 65% of the input units are randomly zeroed out during training. Subsequently, when the augmented data reaches 16 times the original training data, the dropout probability is raised to 0.75. By increasing the number of input units randomly zeroed out, the model is encouraged to rely on a wider array of features during training, thereby enhancing the regularization effect and mitigating overfitting. Figure 2C,D demonstrate that this strong regularization mechanism effectively eliminated the upward trend observed in the testing loss curve in Figure 2A maintaining the consistent improvement in predictive accuracy (Figure 2B,D).

It is noteworthy that these regularization techniques were not merely employed as routine practices but were strategically integrated and implemented to enhance model robustness and generalization capabilities. Importantly, our adaptation of these techniques in SynerGNet includes nuanced adjustments, such as dynamically tuning the L2 regularization strength and dropout probability based on the scale of augmented data, which significantly contributed to maintaining model performance and preventing overfitting throughout training. This rigorous regularization approach is integral to our methodology and underscores its role in achieving superior predictive accuracy and generalization, as demonstrated in our experimental results.

3.6. Gradual Integration of Augmented Data

Rather than utilizing all 2,315,325 augmented synergy instances at once, we aim to explore the impact of gradually integrating augmented data during the training phase on the model predictive performance. We seek to understand whether the incremental addition of augmented data consistently improves the model generalization ability. In these benchmarks, a five-fold cross-validation was employed. The original synergy dataset was first partitioned into five equal subsets while maintaining the same synergistic/antagonistic ratio as the entire original dataset. Next, for each cross-validation cycle, four subsets were allocated for training, while the remaining subset served as the test set. To gradually incorporate augmented data during the training phase, we first included an amount of augmented data equivalent to the original dataset. We then conducted five additional rounds of five-fold cross-validation, each time doubling the amount of augmented data in the training set. Throughout all experiments, the test set exclusively comprised the original instances with experimental labels.

With the gradual integration of augmented data into the training set and an appropriate regularization strength, the model consistently exhibited enhanced performance. Detailed performance evaluation is reported in Table 4. To start with, we conducted a five-fold cross-validation exclusively on the original data, utilizing this performance as the benchmark for comparison. When trained exclusively on the original data, the model yields an AUC of 0.721, a BAC of 0.676, and a PPV of 0.863. The FPR is 0.380 and the MCC between the predicted label and the true label is 0.313. Then we added an amount of augmented data equal to the size of the original data for training in each fold of the five-fold cross-validation and tested the trained model on the original test set. The model showed improved performance across all metrics. Specifically, the AUC increased by 1.8%, the BAC increased by 1.6%, the PPV increased by 0.9%, the FPR reduced by 2.1%, and the MCC increased by 3.1%.

Table 4 and Figure 3 show that with the gradual inclusion of more augmented data into the training set, the model consistently demonstrated improved performance. Figure 3A illustrates the increasing trend of the BAC as a greater amount of augmented data were incorporated into the training process. Figure 3B shows that the testing loss kept decreasing as the model was exposed to a greater variety of training instances, which demonstrates that the inclusion of augmented data enhances the model’s ability to generalize. With the quantity of augmented data 32 times the size of the original training data, notable enhancements were observed across various performance metrics and the AUC increased by 5.8%, the BAC by 5.5%, and the PPV by 3.0%. Simultaneously, the FPR saw a decrease of 7.8%, and the MCC between the predicted and true labels increased by 10%. It is important to note that, in each case, five distinct samples of augmented data were employed. Therefore, Table 4 and Figure 3 report the performance as the average ±standard deviation across these five samples. The use of multiple samples of augmented data ensures the robustness of the results, accurately representing the dataset and instilling statistical confidence in the findings.

4. Discussion

The development of SynerGNet predicting drug synergy against cancer cell lines provides a compelling tool in the field of computational biology and pharmacology. By leveraging a combination of advanced ML techniques, comprehensive feature selection, and careful evaluation methodologies, we have demonstrated the potential for robust and accurate prediction of the synergistic effects of drug combinations. One of the key findings of our study is the effectiveness of data augmentation in addressing the challenge of limited data available. The augmentation process, which involves generating additional instances by replacing drugs in drug pairs based on similarity scores, significantly expanded the dataset, and improved the generalization ability of the model. The comparable performance observed between models trained on the original data and those trained on an equivalent amount of augmented data underscores the quality and dependability of the augmented dataset. Furthermore, our analysis of feature importance underscores the significance of comprehensive feature selection in enhancing prediction performance. By incorporating molecular and cellular features into our model, we were able to capture a diverse range of biological factors influencing drug synergy. The incremental addition of features led to improvements in model performance, indicating the importance of considering multiple dimensions of biological data in predictive modeling.

The detailed analysis of the interplay between testing loss and model performance underscores the nuanced dynamics within the model. While an increase in testing loss usually signals overfitting and reduced generalization, our findings indicate a more complex relationship. Throughout our investigation, we identified scenarios where the model performance improved despite a rise in testing loss. This intriguing discovery suggests that an elevation in testing loss may not necessarily warrant the termination of model training, especially if the model continues to incorporate more resilient and discriminative patterns. This underscores the importance of conducting in-depth analyses to elucidate the underlying mechanisms that shape model performance. Additionally, our study demonstrates the importance of strong regularization techniques in preventing overfitting and maintaining model generalization. By employing batch normalization, L2 regularization, and dropout layers, we were able to stabilize training, reduce overfitting, and improve model performance. The careful tuning of regularization parameters played a crucial role in ensuring the model robustness and reliability.

Our investigation into the impact of incremental augmented data inclusion provides valuable insights into the scalability and effectiveness of our approach. By gradually increasing the amount of augmented data in the training set, we observed consistent improvements across various performance metrics. This highlights the potential for further enhancements through the continued integration of augmented data and the importance of computational resources in supporting large-scale training efforts. The accuracy of SynerGNet can further be improved by integrating advanced graph convolutional layers, such as SuperGATConv [45], EGConv [46], and SSGConv [47], along with developing a more sophisticated model structure. This includes multi-scale feature integration, where different branches or layers focus on various scales of input data, residual connections to alleviate the vanishing gradient problem and facilitate deeper network training, and attention mechanisms to dynamically weigh the importance of protein interactions. These enhancements are designed to better process complex biological data, enabling more accurate and nuanced predictions, thus improving the overall performance of SynerGNet.

In conclusion, SynerGNet is a powerful tool for predicting drug synergy against cancer cell lines. Through a combination of data augmentation, comprehensive feature selection, model construction, and regularization techniques, we have demonstrated the potential for accurate and reliable prediction of drug interactions. Our findings contribute to the ongoing efforts to harness the power of computational approaches in drug discovery and personalized medicine, offering new opportunities for accelerating the development of effective cancer treatments.

Author Contributions

Conceptualization, M.B. and M.L.; Methodology, M.L.; Software, M.L.; Validation, M.L.; Formal Analysis, M.L.; Investigation, M.L.; Resources, M.B. and M.L.; Data Curation, G.S. and M.L.; Writing—original draft preparation, M.L.; Writing—review and editing, M.B.; Visualization, M.L.; Supervision, M.B. and J.R.; Project Administration, M.B.; Funding Acquisition, M.B. and J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported in part by the Center for Computation and Technology at Louisiana State University.

Data Availability Statement

The code is freely available at https://github.com/MengLiu90/Navigating_Model_Learning_Dynamics_Using_SynerGNet (accessed on 26 July 2024).

Acknowledgments

Portions of this research were conducted with computing resources provided by Louisiana State University.

Conflicts of Interest

The authors declare that no conflicts of interest exist.

Abbreviations

AUC: area under the curve; AZ-DREAM: AstraZeneca dialog for reverse engineering assessments and methods; BAC: balanced accuracy; DL: deep learning; FPR: false positive rate; GCN: graph convolutional network; GIN: graph isomorphism network; GNN: graph neural network; GO: gene ontology; JK-Net: jumping knowledge network; LPAs: label propagation algorithms; LSTM: long short-term memory; MCC: Matthews correlation coefficient; ML: machine learning; PPI: protein–protein interaction; PPV: precision; ReLU: rectified linear unit; SMILES: simplified molecular-input line-entry system; STITCH: search tool for interactions of chemicals; UniMP: unified message passing model.

References

Noble, D. The rise of computational biology. Nat. Rev. Mol. Cell Biol. 2002, 3, 459–463. [Google Scholar] [CrossRef] [PubMed]
Markowetz, F. All biology is computational biology. PLoS Biol. 2017, 15, e2002050. [Google Scholar] [CrossRef] [PubMed]
Caragea, C.; Honavar, V.G. Machine Learning in Computational Biology; Springer: Boston, MA, USA, 2009. [Google Scholar]
Chicco, D. Ten quick tips for machine learning in computational biology. BioData Min. 2017, 10, 35. [Google Scholar] [CrossRef] [PubMed]
Tarca, A.L.; Carey, V.J.; Chen, X.-W.; Romero, R.; Drăghici, S. Machine learning and its applications to biology. PLoS Comput. Biol. 2007, 3, e116. [Google Scholar] [CrossRef]
Angermueller, C.; Pärnamaa, T.; Parts, L.; Stegle, O. Deep learning for computational biology. Mol. Syst. Biol. 2016, 12, 878. [Google Scholar] [CrossRef]
Vidyasagar, M. Machine learning methods in the computational biology of cancer. Proc. R. Soc. A Math. Phys. Eng. Sci. 2014, 470, 20140081. [Google Scholar] [CrossRef] [PubMed]
Jones, W.; Alasoo, K.; Fishman, D.; Parts, L. Computational biology: Deep learning. Emerg. Top. Life Sci. 2017, 1, 257–274. [Google Scholar] [PubMed]
Wu, L.; Wen, Y.; Leng, D.; Zhang, Q.; Dai, C.; Wang, Z.; Liu, Z.; Yan, B.; Zhang, Y.; Wang, J.; et al. Machine learning methods, databases and tools for drug combination prediction. Brief. Bioinf. 2022, 23, bbab355. [Google Scholar] [CrossRef]
Menden, M.P.; Wang, D.; Mason, M.J.; Szalai, B.; Bulusu, K.C.; Guan, Y.; Saez-Rodriguez, J. Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen. Nat. Commun. 2019, 10, 2674. [Google Scholar] [CrossRef]
Shi, J.-Y.; Li, J.-X.; Mao, K.-T.; Cao, J.-B.; Lei, P.; Lu, H.-M.; Yiu, S.-M. Predicting combinative drug pairs via multiple classifier system with positive samples only. Comput. Methods Programs Biomed. 2019, 168, 1–10. [Google Scholar] [CrossRef]
Wildenhain, J.; Spitzer, M.; Dolma, S.; Jarvik, N.; White, R.; Roy, M.; Griffiths, E.; Bellows, D.S.; Wright, G.D.; Tyers, M. Prediction of synergism from chemical-genetic interactions by machine learning. Cell Syst. 2015, 1, 383–395. [Google Scholar] [CrossRef] [PubMed]
Torkamannia, A.; Omidi, Y.; Ferdousi, R. A review of machine learning approaches for drug synergy prediction in cancer. Brief. Bioinf. 2022, 23, bbac075. [Google Scholar] [CrossRef] [PubMed]
Preuer, K.; Lewis, R.P.; Hochreiter, S.; Bender, A.; Bulusu, K.C.; Klambauer, G. DeepSynergy: Predicting anti-cancer drug synergy with Deep Learning. Bioinformatics 2018, 34, 1538–1546. [Google Scholar] [CrossRef] [PubMed]
Sun, Z.; Huang, S.; Jiang, P.; Hu, P. DTF: Deep tensor factorization for predicting anticancer drug synergy. Bioinformatics 2020, 36, 4483–4489. [Google Scholar] [CrossRef] [PubMed]
Zhang, T.; Zhang, L.; Payne, P.R.; Li, F. Synergistic drug combination prediction by integrating multiomics data in deep learning models. Transl. Bioinf. Ther. Dev. 2021, 2194, 223–238. [Google Scholar]
Tang, Y.-C.; Gottlieb, A. SynPathy: Predicting drug synergy through drug-associated pathways using deep learning. Mol. Cancer Res. 2022, 20, 762–769. [Google Scholar] [CrossRef]
Askr, H.; Elgeldawi, E.; Ella, H.A.; Elshaier, Y.A.M.M.; Gomaa, M.M.; Hassanien, A.E. Deep learning in drug discovery: An integrative review and future challenges. Artif. Intell. Rev. 2023, 56, 5975–6037. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Xu, Z.; Wu, W.K.K.; Chu, Q.; Zhang, Q. GraphSynergy: A network-inspired deep learning model for anticancer drug combination prediction. J. Am. Med. Inform. Assoc. 2021, 28, 2336–2345. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Liu, X.; Shen, S.; Deng, L.; Liu, H. DeepDDS: Deep graph neural network with attention mechanism to predict synergistic drug combinations. Brief. Bioinf. 2022, 23, bbab390. [Google Scholar] [CrossRef]
Wang, T.; Wang, R.; Wei, L. AttenSyn: An attention-based deep graph neural network for anticancer synergistic drug combination prediction. J. Chem. Inf. Model. 2023, 64, 2854–2862. [Google Scholar] [CrossRef]
Liu, M.; Srivastava, G.; Ramanujam, J.; Brylinski, M. SynerGNet: A Graph Neural Network Model to Predict Anticancer Drug Synergy. Biomolecules 2024, 14, 253. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Zhang, W.; Zou, B.; Wang, J.; Deng, Y.; Deng, L. DrugCombDB: A comprehensive database of drug combinations toward the discovery of combinatorial therapy. Nucleic Acids Res. 2020, 48, D871–D881. [Google Scholar] [PubMed]
Wang, X.; Zhu, H.; Jiang, Y.; Li, Y.; Tang, C.; Chen, X.; Li, Y.; Liu, Q. PRODeepSyn: Predicting anticancer synergistic drug combinations by embedding cell lines with protein–protein interaction network. Brief. Bioinf. 2022, 23, bbab587. [Google Scholar] [CrossRef]
Bjerrum, E.J. SMILES enumeration as data augmentation for neural network modeling of molecules. arXiv 2017, arXiv:1703.07076 2017. [Google Scholar]
Sidorov, P.; Naulaerts, S.; Ariey-Bonnet, J.; Pasquier, E.; Ballester, P.J. Predicting synergism of cancer drug combinations using NCI-ALMANAC data. Front. Chem. 2019, 7, 509. [Google Scholar] [CrossRef]
Ye, Z.; Chen, F.; Zeng, J.; Gao, J.; Zhang, M.Q. ScaffComb: A Phenotype-Based Framework for Drug Combination Virtual Screening in Large-Scale Chemical Datasets. Adv. Sci. 2021, 8, 2102092. [Google Scholar] [CrossRef]
Liu, M.; Srivastava, G.; Ramanujam, J.; Brylinski, M. Augmented drug combination dataset to improve the performance of machine learning models predicting synergistic anticancer effects. Sci. Rep. 2024, 14, 1668. [Google Scholar] [CrossRef] [PubMed]
Santos, C.F.G.D.; Papa, J.P. Avoiding overfitting: A survey on regularization methods for convolutional neural networks. ACM Comput. Surv. (CSUR) 2022, 54, 1–25. [Google Scholar] [CrossRef]
Aghajanyan, A. Softtarget regularization: An effective technique to reduce over-fitting in neural networks. In Proceedings of the 2017 3rd IEEE International Conference on Cybernetics (CYBCONF), Exeter, UK, 21–23 June 2017; IEEE: Washington, DC, USA, 2017. [Google Scholar]
Ying, X. An Overview of Overfitting and Its Solutions; Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2019. [Google Scholar]
Szklarczyk, D.; Santos, A.; Von Mering, C.; Jensen, L.J.; Bork, P.; Kuhn, M. STITCH 5: Augmenting protein–chemical interaction networks with tissue and affinity data. Nucleic Acids Res. 2016, 44, D380–D384. [Google Scholar] [CrossRef]
Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst. 2017, 30, 1025–1035. [Google Scholar] [CrossRef]
Brody, S.; Alon, U.; Yahav, E. How attentive are graph attention networks? arXiv 2021, arXiv:2105.14491 2021. [Google Scholar]
Xu, K.; Hu, W.; Leskovec, J.; Jegelka, S. How powerful are graph neural networks? arXiv 2018, arXiv:1810.00826 2018. [Google Scholar]
Shi, Y.; Huang, Z.; Feng, S.; Zhong, H.; Wang, W.; Sun, Y. Masked label prediction: Unified message passing model for semi-supervised classification. arXiv 2020, arXiv:2009.03509 2020. [Google Scholar]
Li, G.; Xiong, C.; Thabet, A.; Ghanem, B. Deepergcn: All you need to train deeper gcns. arXiv 2020, arXiv:2006.07739 2020. [Google Scholar]
Weisfeiler, B.; Leman, A. The reduction of a graph to canonical form and the algebra which appears therein. Nti Ser. 1968, 2, 12–16. [Google Scholar]
Vinyals, O.; Bengio, S.; Kudlur, M. Order matters: Sequence to sequence for sets. arXiv 2015, arXiv:1511.06391 2015. [Google Scholar]
Knyazev, B.; Taylor, G.W.; Amer, M. Understanding attention and generalization in graph neural networks. Adv. Neural Inf. Process. Syst. 2019, 32, 4202–4212. [Google Scholar] [CrossRef]
Ma, Z.; Xuan, J.; Wang, Y.G.; Li, M.; Liò, P. Path integral based convolution and pooling for graph neural networks. Adv. Neural Inf. Process. Syst. 2020, 33, 16421–16433. [Google Scholar] [CrossRef]
Ahmadi, A.H.K. Memory-Based Graph Networks; University of Toronto: Toronto, ON, Canada, 2020. [Google Scholar]
Mazandu, G.K.; Hooper, C.; Opap, K.; Makinde, F.; Nembaware, V.; Thomford, N.E.; Mulder, N.J. IHP-PING—Generating integrated human protein–protein interaction networks on-the-fly. Brief. Bioinf. 2021, 22, bbaa277. [Google Scholar] [CrossRef]
Soltius. How Is It Possible That Validation Loss Is Increasing While Validation Accuracy Is Increasing as Well. Available online: https://stats.stackexchange.com/q/341054 (accessed on 22 February 2024).
Kim, D.; Oh, A. How to find your friendly neighborhood: Graph attention design with self-supervision. arXiv 2022, arXiv:2204.04879 2022. [Google Scholar]
Tailor, S.A.; Opolka, F.L.; Lio, P.; Lane, N.D. Do we need anisotropic graph neural networks? arXiv 2021, arXiv:2104.01481 2021. [Google Scholar]
Zhu, H.; Koniusz, P. Simple spectral graph convolution. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]

Figure 1. SynerGNet architecture layout. (A) SynerGNet takes a graph-based representation of a pair of drugs and a target cancer cell line as the input. (B) This graph is processed by two sequential graph convolution modules designed for feature propagation and aggregation within the graph. Following each convolutional module, a new, updated embedding is produced for each node. (C) Embeddings from different convolution modules are merged using the jumping knowledge network (JK-Net). A readout mechanism employing a pooling layer extracts graph-level features from these node-level features. Finally, the graph embeddings are input into a prediction module, which determines whether the drug pair has a synergistic or antagonistic effect on the cancer cell line.

Figure 2. Model performance during initial training with augmented data. The training set includes 16 times more augmented data than the original data. Model performance is evaluated (A,B) before and (C,D) after increasing the regularization strength. Plots show (A,C) train (red) and test (navy blue) loss, and (B,D) predictive accuracy (purple) and balanced accuracy (blue) as a function of the epoch number.

Figure 3. Results of GNN training with gradually increased augmented data. Models are evaluated by (A) balanced accuracy and (B) testing loss. The x-axis represents the factor n in the formula 2ⁿ, which determines the ratio of augmented data size to the original training data size. GNN models are trained on 2ⁿ augmented data combined with original AZ-DREAM Challenges data and tested against the validation set from the original data. Circles represent average values over five different samples of the augmented data for each n, while boxes indicate the corresponding standard deviation.

Table 1. Performance of a baseline model to test features and augmented data in the prediction of drug synergistic effects. A five-fold cross-validation protocol is employed. The baseline model is trained using either the original AZ-DREAM Challenges data or an equal amount of augmented data. It is then evaluated against unseen data from the original dataset. All features include drug affinity, gene ontology (GO) terms, gene expression, copy number variation, and mutation information.

Features	Training Dataset	AUC ^a	BAC ^b	PPV ^c	FPR ^d	MCC ^e
Drug affinity + GO terms	Original	0.704	0.622	0.817	0.667	0.294
All features	Original	0.713	0.636	0.824	0.634	0.317
All features	Augmented (sample 1)	0.711	0.647	0.833	0.566	0.308
	Augmented (sample 2)	0.718	0.650	0.831	0.603	0.338
	Augmented (sample 3)	0.695	0.613	0.814	0.697	0.281

^a Area under the receiver operating characteristic plot. ^b Balanced accuracy. ^c Precision. ^d False positive rate. ^e Matthews correlation coefficient.

Table 2. Performance of GNN models with different graph convolutional layers. A five-fold cross-validation protocol is employed. The models are trained against the original AZ-DREAM Challenges data combined with the augmented data and then evaluated against unseen data from the original dataset.

Convolutional Layer	AUC ^a	BAC ^b	PPV ^c	FPR ^d	MCC ^e
SAGEConv	0.735	0.694	0.874	0.354	0.348
GATv2Conv	0.737	0.693	0.875	0.349	0.344
GINConv	0.741	0.689	0.871	0.359	0.336
TransformerConv	0.735	0.695	0.868	0.391	0.367
GENConv	0.747	0.696	0.877	0.338	0.348

^a Area under the receiver operating characteristic plot. ^b Balanced accuracy. ^c Precision. ^d False positive rate. ^e Matthews correlation coefficient.

Table 3. Six different scenarios of prediction probability change from epoch 97 to 150. Prediction probability is utilized as a measure of confidence. If the true label is 1, the higher the prediction probability, the more confident the model is when classifying the input as belonging to class 1. If the true label is 0, the lower the prediction probability, the more confident the model is when classifying the input as belonging to class 0.

Cell Line	Drug 1	Drug 2	Prediction Probability		Label		Scenario
Cell Line	Drug 1	Drug 2	Epoch 97	Epoch 150	True	Predicted ^a	Scenario
647V	CIDs11955716	CIDs10127622	0.859	0.873	1	1	1
BFTC905	CIDs25262792	CIDs11152667	0.442	0.348	0	0	1
A549	CIDs11597571	CIDs56946894	0.705	0.635	1	1	2
BT474	CIDs00005311	CIDs03081361	0.178	0.259	0	0	2
647V	CIDs10302451	CIDs25262792	0.505	0.326	1	1 at epoch 97 0 at epoch 150	3
HCC1187	CIDs25227436	CIDs00123631	0.479	0.592	0	0 at epoch 97 1 at epoch 150	3
CAMA1	CIDs24964624	CIDs00005376	0.410	0.609	1	0 at epoch 97 1 at epoch 150	4
HCC1500	CIDs00176166	CIDs51039095	0.581	0.277	0	1 at epoch 97 0 at epoch 150	4
C32	CIDs03081361	CIDs10127622	0.316	0.418	1	0	5
BFTC905	CIDs72793174	CIDs11375205	0.650	0.525	0	1	5
647V	CIDs11375205	CIDs17755052	0.487	0.362	1	0	6
CAL51	CIDs25262792	CIDs67970189	0.795	0.846	0	1	6

^a If prediction probability is ≥0.5, the instance is predicted as synergistic; if prediction probability is <0.5, the instance is predicted as antagonistic.

Table 4. Performance of SynerGNet as augmented data are gradually included in model training. A five-fold cross-validation protocol is employed. The GNN model is trained on augmented and original AZ-DREAM Challenges data and tested against the validation set from the original dataset. Performance of the GNN model is evaluated for 5 different samples of the augmented data for each case. The performance table shows the average value ± standard derivation over these 5 samples.

Training Dataset (n) ^a	AUC ^b	BAC ^c	PPV ^d	FPR ^e	MCC ^f
Original data	0.721	0.676	0.863	0.380	0.313
0	0.739 ± 0.0013	0.692 ± 0.0042	0.872 ± 0.0055	0.359 ± 0.0289	0.344 ± 0.0070
1	0.749 ± 0.0014	0.702 ± 0.0063	0.880 ± 0.0049	0.329 ± 0.0176	0.357 ± 0.0108
2	0.759 ± 0.0057	0.710 ± 0.0042	0.883 ± 0.0029	0.324 ± 0.0106	0.373 ± 0.0082
3	0.771 ± 0.0071	0.718 ± 0.0061	0.884 ± 0.0055	0.333 ± 0.0250	0.392 ± 0.0081
4	0.777 ± 0.0049	0.726 ± 0.0030	0.887 ± 0.0031	0.324 ± 0.0158	0.410 ± 0.0079
5	0.779 ± 0.0068	0.731 ± 0.0031	0.893 ± 0.0049	0.302 ± 0.0237	0.413 ± 0.0083

^a The factor n determines the ratio of the augmented data size to the original AZ-DREAM Challenges training data size, following the formula 2ⁿ. For example, when n = 0, the augmented data matches the original data size (2⁰ = 1). Increasing values of n indicate a larger proportion of the augmented data in the training set relative to the original data size. ^b Area under the receiver operating characteristic plot. ^c Balanced accuracy. ^d Precision. ^e False positive rate. ^f Matthews correlation coefficient.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, M.; Srivastava, G.; Ramanujam, J.; Brylinski, M. Insights from Augmented Data Integration and Strong Regularization in Drug Synergy Prediction with SynerGNet. Mach. Learn. Knowl. Extr. 2024, 6, 1782-1797. https://doi.org/10.3390/make6030087

AMA Style

Liu M, Srivastava G, Ramanujam J, Brylinski M. Insights from Augmented Data Integration and Strong Regularization in Drug Synergy Prediction with SynerGNet. Machine Learning and Knowledge Extraction. 2024; 6(3):1782-1797. https://doi.org/10.3390/make6030087

Chicago/Turabian Style

Liu, Mengmeng, Gopal Srivastava, J. Ramanujam, and Michal Brylinski. 2024. "Insights from Augmented Data Integration and Strong Regularization in Drug Synergy Prediction with SynerGNet" Machine Learning and Knowledge Extraction 6, no. 3: 1782-1797. https://doi.org/10.3390/make6030087

APA Style

Liu, M., Srivastava, G., Ramanujam, J., & Brylinski, M. (2024). Insights from Augmented Data Integration and Strong Regularization in Drug Synergy Prediction with SynerGNet. Machine Learning and Knowledge Extraction, 6(3), 1782-1797. https://doi.org/10.3390/make6030087

Article Menu

Insights from Augmented Data Integration and Strong Regularization in Drug Synergy Prediction with SynerGNet

Abstract

1. Introduction

2. Materials and Methods

2.1. Original and Augmented Synergy Datasets

2.2. Baseline Model

2.3. Graph Convolutional Layers

2.4. Other Components of the GNN Model

2.5. Graph Representation of Input Data

2.6. SynerGNet Architecture

3. Results

3.1. Feature Test with a Baseline Model

3.2. Augmented Data Quality Test with a Baseline Model

3.3. Performance of GNN with Different Convolutional Layers

3.4. Initial Benchmarks of SynerGNet with Augmented Data

3.5. Strong Regularization for Robust Learning

3.6. Gradual Integration of Augmented Data

4. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI