1. Introduction
Multiple myeloma (MM) is characterized by the clonal proliferation of immunoglobulin-secreting malignant plasma cells in the bone marrow. Despite high-dose chemotherapy, autologous stem cell transplants, and novel agents, the 5-year survival for MM remains below 50%, and the median survival for those with stage III disease is less than 30 months. A hallmark of the disease is the subsequent development of drug-resistant phenotypes, which may be present initially or emerge during the course of treatment and which reflect the intra-tumour and inter-patient heterogeneity of this cancer [
1]. Most MM cells are initially sensitive to proteasome inhibitors (PIs), which have become the standard of care in the treatment of newly diagnosed and relapsed MM. Nevertheless, resistance (intrinsic/acquired) develops to PIs and other forms of MM therapy.
Protein biomarkers associated with sensitivity/resistance to therapy are biological measures that likely reflect the specific phenotype of the patient’s disease. For these protein biomarkers to be used to direct patient care, they must have clinical utility with very high levels of evidence, demonstrating the ability to improve clinical decision making and patient outcomes. Biomarker research in MM continues to advance in many areas and is imperative for aiding the risk stratification of patients, examining tumour evolution during progression from indolent to aggressive disease, facilitating the commencement of therapy at the most treatable stage, for the selection of therapeutic agents, and predicting the effect of therapeutic intervention with respect to sensitivity or resistance to specific agents. Next-generation proteomics are highly sensitive, less costly, and require reduced input material; thus, they will likely assist in clinical decision making in the near future. Ho et al. [
2] reported that although proteomic technologies are not currently approved for clinical use in MM, there are an increasing number of studies that show great promise. Sasser et al. [
3] developed a serum biomarker panel that predicts the imminent risk of multiple myeloma progression from premalignancy, whereas Bai et al. [
4] identified a four-peptide panel (dihydropyrimidinase-like 2, fibrinogen
-chain, platelet factor 4, and
-fetoprotein) that predicted MM with a sensitivity and specificity of 93.55% and 92.19%, respectively, using nano-liquid chromatography, electrospray ionization, and tandem mass spectrometry (nanoLC-ESI-MS/MS). Our group has also developed a novel panel of protein biomarkers to predict responses to Bortezomib-containing induction regimens in multiple myeloma patients. Three novel biomarkers (clusterin/CLU, angiogenin/ANG, and complement 1Q/C1Q) that were predictive of Bortezomib response were identified [
5]. Finally, proteomics has also lead to the identification of proteins that are altered when comparing the serum proteome from multiple myeloma patients with varying degrees of bone disease [
6].
Previous studies from our group have generated proteomics data by performing mass spectrometry on MM patients’ plasma cells and grouped those data based on ex vivo drug sensitivity and resistance testing (DSRT) [
7]. This The Individualized Systems Medicine approach, developed at the Finnish Institute for Molecular Medicine (FIMM), includes ex vivo chemosensitivity to 308 anti-cancer drugs, including standard-of-care and investigational drugs, with the intent to guide treatment decisions for individual cancer patients. MM patients were stratified into four distinct subgroups as follows: highly sensitive (Group 1), sensitive (Group 2), resistant (Group 3), or highly resistant (Group 4) to the panel of drugs tested [
8,
9,
10]. Combined with the proteomic analysis of the four groups of CD138+ plasma cells, a highly significant differential proteomic signature between the four chemosensitivity profiles was identified, thus opening the way to a theranostic approach to patient treatment.
However, the study of malignant plasma cell samples from MM patients presents several challenges in terms of defining sensitive versus resistant cohorts. For example, the current International Myeloma Working Group (IMWG) criteria for assessing the response to treatment are broad and overlap between different groups. Without a clear objective delineation between the sensitive and resistance groups, the comparative proteomic statistical analysis is weakened, making it difficult to clearly identify a resistant protein phenotype. Similarly, the use of triplet combinations as a standard of care makes it difficult to identify the resistance to individual drugs [
7].
Due to the significant developments in data science along with open access to medical data, machine learning (ML) has enabled the research community to investigate new medical approaches in hematology and particularly in MM, starting from diagnosis to prognosis, therapeutic intervention, and more [
11,
12,
13,
14,
15,
16]. For example, Povoa et al. [
14] presented a multi-learning training approach that combines supervised, unsupervised, and self-supervised learning algorithms to examine the predictive value of heterogeneous treatment outcomes for MM using genes as biomarkers. Guerrero et al. [
15] used ML to predict undetectable, measurable residual disease as a surrogate of prolonged survival. Ren et al. also worked on the survival prognosis [
16]. They associated genes with overall survival as well as relationships between gene signature expression and common drugs used for MM.
In this paper, we present a novel methodology that explores whether proteomics could be reliably used to build ML models to infer the drug sensitivity of a patient, as outlined in
Figure 1. As a first step, ML models are employed for data exploration and then for patient stratification using the patients’ proteomic profiles to identify the most accurate sensitivity groupings. This stratification confirms that proteomics can be used as features to train predictors related to drug sensitivity. We then build ML models to infer the sensitivity or resistance of each patient to a list of drugs based on their proteomic profile. These models indicate if a patient is sensitive or resistant to a drug based on their proteomic signature. Due to the small size of the patient cohort and the imbalance of the sensitive vs. resistant strata, data resampling techniques were applied, and the ML models were verified using multi-fold cross-validation.
3. Discussion
MM exhibits significant variability in both biological and clinical settings, with a notable absence of a universally applicable stratification tool for personalized treatment [
7,
32]. This work investigated the proteomic makeup patterns of MM patients as well as the potential to suggest personalized treatment based on those proteomic signatures. We show that proteomic data can aid in the grouping of MM patients into different chemosensitivity groups. We confirmed that using ML techniques to identify patient proteomic profile groups is more effective than using hand-picked biomarkers. The ML-based proteomic clustering resulted in meaningful groups that aligned well with the DSS. Furthermore, it proved that it is useful to use proteomic profiles to infer the DSS groups. From the vast number of proteins, it was observed that there is a strong indication of cross-correlation (and co-linearities), leading to an effective dimensionality reduction by selecting number of top-ranking proteins based on an ANOVA ranking.
One of the most significant contributions of this work is that the proposed framework has demonstrated the potential of building ML models for personalized treatment based on the proteomic profile of a patient. This is a novel finding that confirms Tierney et al.’s [
7] suggestion to build predictive models that use proteomics as biomarkers. However, we note that the clinical significance of the output biomarkers should be thoroughly investigated prior to reaching the clinical setting with studies that would investigate whether these biomarkers are involved in the pathophysiology of the disease as well as whether there are drugs available for the output targets. Even more so, it should be evaluated whether a novel biomarker provides enough useful information to justify measuring it in the context of clinical care. Nevertheless, the addition of a non-specific biomarker into a biomarker panel can add to its predictive value, as reported by Lourenco et al. [
33]. Furthermore, although built on a different set of biomarkers (mostly genes) and trained on a different and larger dataset, the ML-based personalized treatment framework proposed by Povoa et al. [
14] did not achieve as high accuracy, precision, and sensitivity scores as our proposed model (our accuracy was, on average, 81%, while Povoa et al. [
14] achieved up to 65%).
Our study provides important insights into using ML approaches for MM biomarker development; nevertheless, there are certain limitations, including the small patient cohort size and the data imbalance with regards to the different DSSs. We explored different balancing techniques that helped build better models. The best-performing ML model was the SVM, along with SMOTE as a class balancing technique. This is probably due to the way SMOTE works to create new minority samples that train the model in a more balanced way and avoid overfitting the dominant class. The limitations discussed in this work emphasise the need for open datasets, especially in rare diseases where data collection is cumbersome. Open datasets will enable clinicians and data scientists to confirm the findings around patient profile groupings and further improve the performance of DSS inference models. In addition, while proteomics-based studies undoubtedly contribute to our knowledge of the MM disease, the challenge moving forward is to not become lost in this multitude of information and be able to focus on actionable knowledge (Ho et al. [
2]) to improve the outcome for our MM patients.
4. Materials and Methods
4.1. Data Collection
The proteomic dataset has been previously published by [
7]. The dataset is fully anonymised and, therefore, GDPR-compliant. This dataset consists of 39 patients and proteomic levels for 2573 proteins for each patient. Moreover, for each patient, sensitivity scores are provided for 307 drugs. In this work, only Bortezomib, Lenalidomide, Navitoclax, Pomalidomide, Quisinostat, and Venetoclax were studied, as they are most frequently included in the recommended chemotherapeutic regimes.
4.2. Patient Groups and Stratification
For the patient grouping identification,
k-means [
26,
27] was used as it groups data points based on the similarity of the proteins.
k-means works by randomly defining
k centroid points and categorising data points according to the centroid closest to them. These centroids are then updated iteratively to optimise the clusters.
k-means was tested using the top 10, 37, 50, and 100 proteins as features. In addition, the number of clusters was explored. Clusters of 2, 3, 4, 5, and 6 using the different numbers of features for the new and original protein rankings were tested. Feature selection methods, such as Principal Component Analysis (PCA) [
34,
35], were employed to further reduce the number features (i.e., proteins).
4.2.1. Metrics for Evaluation of the Strata
The formed clusters were evaluated using the Silhouette (SS) [
28], Davies–Bouldin (DB) [
29], and Calinski–Harabasz (CH) [
30] scores. The silhouette coefficient for a single point can be defined as
where
a is the mean distance between the sample and all other points in the same class and
b is the mean distance between a sample and other points in the next-nearest class [
36]. The silhouette score is bound between −1 and 1, where the higher-scoring clusters are dense and well separated, and the lower-scoring clusters indicate incorrect clustering. If a score is around zero, this indicates an overlap in the clusters.
The Davies–Bouldin score compares the average distance between clusters with the size of the clusters themselves and is defined as:
where
s is the average distance between each point of a cluster,
x, and the centroid of the cluster and
is the distance between the centroids of clusters,
i and
j [
36]. The DB index has a lower bound of zero. Values closer to zero indicate a better separation between clusters.
The CH score expresses the ratio of the sum of between-cluster dispersion and the sum of within-cluster dispersion for all clusters, where dispersion is the sum of distances squared, as below
where
is the between-group dispersion matrix,
is the within-group dispersion matrix,
n is the size of the dataset, and
k is the number of clusters [
36].
where
is a point in a cluster,
q, and
is the centre of the cluster
q.
where
is the centre of the dataset and
is the number of points in the cluster
q. The CH score is not bound. High CH values indicate dense and well-separated clusters, while lower values indicate incorrect clustering.
4.3. Drug Sensitivity Prediction
The following types of classification models were explored for the drug sensitivity classifiers: LR, SVM with various kernels, RFs with different split parameters, ensemble trees, Gaussian processes, and forward-feed neural networks. As explained earlier, LR, SVM, and RFs exhibited better performance. An introduction to these models is included in the following paragraphs.
For all models, the dataset was split into test, training, and validation sets. The validation set contained three patients and was left untouched, while the training and testing models contained 23 and 11 patients in a 70:30 split, respectively. The dataset contained the proteomic profile of each patient and the level of sensitivity of each drug. The sensitivity level was determined based on the patient’s DSS. A patient was classified as responsive or sensitive for a drug if for that chemotherapeutic (label of “1”). Any other DSS implied that the patient was resistant (label of “0”).
4.3.1. LR
Linear regression is a basic and commonly used type of predictive analysis. It is a statistical method used to predict the value of a dependent variable based on its relationship to one or more independent variables [
37]. The goal of linear regression is to find the line of best fit that minimizes the distance between predicted and actual observation values. This line of best fit is characterized by a linear regression equation that includes estimated model parameters. These parameters are estimated using a least squares approach that minimizes the sum of squared residuals between predicted and actual values. The key assumptions of linear regression models include linearity, normality, homoscedasticity, and a lack of multicollinearity [
38]. Its advantages include ease of interpretation, computational efficiency, and the ability to quantify the strength of relationships. Linear regression is used for predictive modeling and forecasting across many fields, including finance, the natural sciences, and the social sciences.
For LR, the C value, the number of features selected, and the polynomial value of the features were fine-tuned. The range of the evaluated C values was {0.001, 0.01, 0.1, 1, 10}. The range of features tested was 10 to 35, with a step size of five. The polynomials explored ranged from degrees one to four. From this exploration, the model with a C value of one, 25 features, and a polynomial degree of one was chosen.
4.3.2. SVMs
SVMs are a type of supervised machine learning algorithm commonly used for classification and regression tasks [
39]. The goal of an SVM is to find the optimal decision boundary or hyperplane that maximizes the margin between different classes in the training data. The data points that define this hyperplane are called support vectors [
40]. SVMs utilize a kernel trick to project data into higher dimensions, making them more separable. Some commonly used kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid. As a discriminative classifier, SVMs draw boundaries between classes rather than modeling class probabilities generatively [
41]. The key advantages of SVMs include good generalization ability, handling high-dimensional data well, and flexibility in modeling diverse data sources. Applications where SVMs excel include natural language processing, image recognition, and bioinformatics, due to their aptitude for working with small sample sizes and high-dimensional data.
Similarly to the LR, for the SVM model, the C value, the number of features, and the kernel algorithm were calibrated. The C values evaluated were {0.001, 0.01, 0.1, 1, 5, 25, 50, 75, 100}. The range of features examined was 10 to 35, with a step size of five. The kernel functions explored were linear, radial basis functions, polyonymic, and sigmoid. The radial basis function kernel with a C value of 50 using 25 features performed best for this model.
4.3.3. RFs
RFs are an ensemble supervised machine learning technique used for both classification and regression tasks. They operate by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes or mean prediction of the individual decision trees [
42]. The training algorithm for RFs introduces randomness when building each decision tree to de-correlate with each other. This is achieved by selecting a random subset of features to consider for splits at each node and using bootstrapping or bagging to sample the training data. The key advantages of RFs include robustness to overfitting, the ability to handle many input variables without variable deletion, and effectiveness in estimating missing data and maintaining accuracy with missing data [
43]. RFs have widespread applications in the fields of bioinformatics, finance, healthcare, marketing, and more.
Lastly, for RFs, the maximum number of features when looking for the best split for each tree () and the number of features used were investigated. The range of features considered was 10 to 35, with a step size of five. The range explored was from 1 to 30. Here a of five and ten features produced the best results.
4.4. Metrics for Evaluation of Accuracy of Prediction
The classification models were compared across several metrics: area under curve (AUC), accuracy, precision, recall, also called sensitivity in bioengineering, and F1 score [
44]. The AUC calculates the area under the Receiver Operating Characteristic (ROC) curve. The ROC curve is the Recall/True Positive Rate of Equation (
6) plotted against the False Positive Rate (FPR) of Equation (
8) at different probability thresholds. Recall/sensitivity is the probability that a positive example is predicted to be positive.
Sensitivity is also often seen alongside the metric specificity, which is the proportion of true negatives correctly identified.
FPR is the proportion of negative values that are incorrectly labeled as positive.
Accuracy is a measure of the probability that a prediction is correct and is given as:
Precision is the probability that a positive prediction belongs in the positive class.
The F1 score is the harmonic mean of precision and recall.
When accuracy, precision, recall (sensitivity), and F1 score were calculated, each metric’s macro-average was found. The macro-average is the average of each metric calculated for each class. When these metrics are discussed, the macro-average is discussed.
4.5. Methodology for the Imbalanced Dataset Problem
It is seen in
Section 2.4 that the imbalance in the dataset makes it difficult to create a model that can correctly identify patients in the minority sensitivity group for a drug. In order to counteract the data imbalance, several data-balancing techniques were looked at, falling into the groups of penalisation, over-sampling, under-sampling, and boosting [
31]. After applying and testing these different techniques, the most promising models were LR+OSS, SVM+SMOTE, and RF+CNN. Penalisation resulted in improved results for all models, but not as good as the ones produced by the aforementioned combinations. Within the category of boosting algorithms, AdaBoost [
45] was tested alongside the better-performing models; however, no improvement was reported.
4.5.1. Over-Sampling
Over-sampling is a balancing technique that increases the number of samples in the minority class. This can be achieved by replicating data in that class. However, replicating data is considered unsuitable for this project due to the heterogeneity of this disease [
46]. Over-sampling can also be carried out by synthetically creating data. SMOTE [
47] is a technique that is often applied to biomedical data, e.g., [
48,
49]. SMOTE creates data by selecting the
k nearest neighbours of a data point and interpolating between the data point and the neighbours. SMOTE, therefore, generates a data point similar to the original but not the same [
46,
47]. SMOTE was applied to the training data for LR, SVM, and RF to over-sample the dataset. Implementing SMOTE increased accuracy, precision, recall, and F1 in the SVM, making this the best-performing model.
4.5.2. Under-Sampling
Under-sampling is a technique that reduces the number of samples taken from the majority class. This can be achieved by randomly selecting a subset of the majority of the class data [
46]. However, random under-sampling like this can lead to information loss; therefore, more complex methods were used, such as Near Miss [
50,
51], Condensed Nearest Neighbour (CNN) [
52], One-Sided Selection (OSS) [
53], All KNN [
54], and Instance Hardness Threshold (IHT) [
53]. CNN and OSS proved to be better compared to the rest for our dataset and employed ML models.
The CNN obtains all minority samples in the training set and adds a sample from the majority class to this set
C; all other samples are kept in set
S. The algorithm then iterates through set
S and classifies each sample using 1 NN. If a sample is incorrectly classified, it is added to
C; otherwise, it is discarded. This process is reiterated until
S is empty. The set
C is then used as the training data [
52,
53]. We noted that using the CNN did not improve the performance of the SVM and LR. Nevertheless, the performance of the RF significantly improved, giving an accuracy of 0.81, a precision of 0.81, a recall of 0.78, and an F1 score of 0.79 on average.
OSS works in a similar way to the CNN but removes noisy data from the dataset instead of adding noisy data [
53]. OSS improved the performance of all models when compared to the original performance but worked the best for LR.
4.6. Software Implementations
The majority of the algorithms and statistical methods were implemented in Python3. A few diagrams were produced in Matlab. The code and anonymised data will be made available upon request from the corresponding author.