1. Introduction
According to the World Cancer Research Fund International, bladder cancer is the 10th most common cancer in the world [
1]. It is diagnosed mostly in people over 55 in highly developed countries of southern and western Europe, as well as in North America. Men are more than four times more likely to develop bladder cancer than women. The most commonly mentioned urinary bladder cancer risks, other than being male, are smoking cigarettes, exposure to certain chemicals (such as aromatic amines, polycyclic aromatic hydrocarbons, and chlorinated hydrocarbons and alcohol), having a red meat-rich diet, and being genetically predisposed (reviewed in [
1]). Urothelial carcinoma of the bladder is divided into two major groups on the basis of clinical staging with different clinical outcomes and therapy options: non-muscle-invasive bladder cancer (NMIBC) and muscle-invasive bladder cancer (MIBC). MIBCs are aggressive tumors, characterized by a five-year survival rate of less than 50% [
2]. Up to 15% of MIBCs are initially diagnosed as NMIBCs that progressed into MIBCs [
3]. NMIBC is considered a tumor with a relatively good prognosis since the five-year overall survival rate is about 90% [
4]. Unfortunately, NMIBCs are a very heterogeneous tumor group with a high rate of recurrence (up to 70%) and risk of progression to MIBC (up to 20%), despite significant improvement in the adjuvant therapies’ efficacy (reviewed in [
5,
6,
7]). Carcinoma in situ (CIS) belongs to this group and can be diagnosed as a primary or a recurrent tumor. CIS is associated with a poorer prognosis, a higher grade, as well as an elevated risk of recurrence and progression to MIBC [
8]. The recurrence rate for CIS is 63–92%, and the progression to MIBC is 50–75%, even when the immunotreatment is applied [
9,
10]. The current treatment of CIS includes Bacille Calmette–Guérin (BCG) intravesical therapy, but up to 40% of NMIBC patients do not respond to this treatment. In these patients, one of the second-line treatments is cystectomy [
11]. However, cystectomy causes side effects, especially in elderly patients. Recent studies identified some predictors of complications, with frailty index score among them [
12,
13]. Concomitant CIS is also related to a higher recurrence risk and mortality rate [
14]. Thus, there is a need to develop accurate methods for the prediction of recurrence and progression in NMIBC, including CIS. Recently, the molecular markers predicting the progression of NMIBC have been identified [
15]. However, their testing is based on the evaluation of methylation (GATA2 and TBX3) and mutation status (FGFR3); thus, its usefulness for routine use is rather limited due to the associated cost and labor of the tests [
16]. Moreover, there are no specific markers for development of CIS in disease course (CIS-DC). Thus, more exact and accessible models should be developed, and new markers of CIS-DC should be identified.
The goal of the current study is to propose a small, clinically useful set of biomarkers that can be utilized for the stratification of bladder cancer patients into high- and low-risk classes, with respect to the development of CIS in disease course. The study is based on the dataset E-MTAB-4321, first described in [
17] and deposited in the ArrayExpress database [
18]. The dataset consists mostly of patients with Ta and T1 tumor stages. In the original analysis, the authors applied non-supervised learning to stratify patients into three groups using 119 genetic markers, showing that these three groups differ significantly in the risk of progressing to stage T2+. The original classification was extended in subsequent works by various authors [
19,
20,
21].
The approach proposed in the current study is based on a robust protocol utilizing multiple supervised and non-supervised machine learning methods, including an extensive use of cross validation and resampling.
2. Materials and Methods
Dataset
The E-MTAB-4321 dataset, used in the study, contains clinical and RNA-seq data from 476 patients with early stage urothelial carcinoma, of whom 74 have developed CIS at a certain point of the disease course, whereas 402 were free of CIS during the study period. There are 43,204 genetic markers in this dataset, out of which 4800 have 0 variance, resulting in 38,404 markers actually carrying any information. A summary of the dataset characteristics is present in
Table 1 and in
Appendix A. For details on data collection, please refer to the original paper by Hedegaard et al. [
17].
The analytical protocol is based on supervised feature selection (FS) and supervised classification. In our analysis, we focus on finding markers for predicting the appearance of CIS in disease course.
The following base feature-selection protocol is used. We first identify all informative variables and, therefore, reduce the dimensionality of the problem. Then, we further decrease the dimensionality by clustering similar variables. Finally, we use clusters’ representatives to build machine learning models for the prediction of CIS-DC. Each step is described in detail in the following paragraphs.
In the first step, the variables that carry information about future development of CIS in disease course are identified. To this end, we use the multidimensional feature selection (MDFS) filter, which is based on the information entropy and is available as a library in R [
22,
23]. The informative variables are identified by computing information entropy conditioned on the knowledge of the descriptive variables and comparing it with the null distribution of information entropy conditioned on the non-informative variables. This metric is called information gain (IG). In this case, we use single-dimensional analysis, which computes maximum IG over multiple (30) random discretizations of continuous variables. The relevance is determined by a
p-value threshold of 0.05 after applying Holm’s correction [
24].
Unlike minimal-optimal approaches to feature selection, all-relevant feature selection does not have a goal of producing the best set of features for model building. On the contrary—the goal is to preserve the information about all relevant variables so that they and their structure can be studied at will. However, this leads to higher complexity for model building and more uncertainty for tooling to discover such structures. To counter this, we used feature clustering to group similar features together.
Similarity is a concept rooted in clustering (and data analysis in general) and is a broad category. For our purpose, we use a correlation coefficient as our similarity metric; precisely, we use the Pearson’s product–moment correlation coefficient
. However, for the purpose of applying clustering algorithms, we need a function that can be used as a proper metric—that monotonically describes the similar–dissimilar relation and outputs the penalty associated with dissimilarity. Thus, we apply the following transformation to obtain the function
d:
which satisfies the properties of a proper metric and describes dissimilarity as a penalty due to lack of correlation. The function
d is called the dissimilarity function.
We choose hierarchical clustering as our clustering algorithm due to its property of revealing the internal clustering structure. As a method of hierarchical clustering, we evaluated Ward’s minimum variance method as well as the complete linkage method. Of note here is that we applied clustering only to features—not objects nor both objects and features—unlike how clustering and biclustering algorithms are usually used.
We evaluate two ways to choose the representatives to build the classification models. The first is the most commonly applied procedure of working directly with the ranking of features as they are available from the feature-selection method: choosing top-n features with the lowest p-values. Secondly, we evaluate the effect of hierarchical clustering to N clusters and then, analogously, use the ranking to choose one representative from each cluster, basically the top-1 representative from each group.
To evaluate the marker set, we used the Random Forest [
25] (RF) implementation available in R’s randomForest package [
26] as our target classifier. No tweaks to the default parameters were applied. We used the area under the ROC (receiver operating characteristic) curve, also known as AUROC or even AUC (area under curve), to describe the performance of each built classifier.
While evaluating the stability and generality of the above base protocol, we developed an extended procedure that we present here. We propose the use of cross validation as part of the feature-selection protocol. The entire above-mentioned procedure was run in a stratified 5-fold cross validation with 30 repeats, the direct results of which are presented in
Figure 1. Essentially, we have obtained a new ranking from cross validation that allows us to apply the top-
n procedure while using the count of repetitions as the quality metric. For an overview, see
Figure 2.
Furthermore, to estimate the mean and error of our evaluation metric (AUC), we have applied (independently) both external resampling and external cross validation. The resampling procedure consisted of 100 repeats of random sampling with replacement. The omitted objects, called out-of-bag (OOB) objects, were used for verification of the performance of the built models, i.e., for the calculation of the AUC. The cross-validation procedure, on the other hand, was conducted using a stratified 10-fold approach with 30 repeats (independent of the CV inside the procedure). The internal procedure was adjusted to use 10-fold CV as well, to gather enough objects for the MDFS statistic to work well.
Assuming validation with resampling, the full analytical protocol is, thus, as follows (with an overview in
Figure 3):
repeat 30 times: split data randomly in 5 equal bins (i.e., run 30 repetitions of 5-fold CV) and for each (i-th) bin:
- (a)
set aside the i-th bin as the test set and create a training set from the 4 remaining bins;
- (b)
identify informative variables in the training set;
- (c)
cluster informative variables using the hierarchical approach and select representatives of each cluster on each clustering level between 2 and 15, utilizing the usual procedure of choosing the most informative one, and:
find cluster representatives at each level that appear most often in the above 150 iterations (30 times 5 iterations), at each level of clustering between 2 and 15;
use those representatives for building the final model on the entire dataset:
- (a)
estimate the confidence intervals of the final models at each number of representatives (between 2 and 15) using the bootstrap approach—repeat 100 times:
draw with replacement N patients from the original data, build RF models using the 2 to 15 representative variables;
measure the performance of each model using OOB objects;
- (b)
compute the aggregate performance of each model;
- (c)
use the results of the above procedure to propose the relevant markers.
Apart from the above selection of methods, we have verified the final marker set using naive Bayes [
27] and logistic regression classifiers, estimating the achievable diagnostic metrics with such simpler classifiers. The details of the naive Bayes classifier are presented in the
Appendix A and
Appendix B, as it is used as an example simple classifier that is useful for diagnostic personnel.
4. Discussion
Bladder cancer is one of the most-common cancers in the world [
1]; thus, there is a need to develop sensitive methods for the early diagnosis of non-advanced lesions or poor prognosis predictors. Currently, there is a limited number of commercially available tests for bladder cancer diagnosis. The NMP22BC test allows for the diagnosis of non-muscle-invasive bladder cancer and low-grade bladder cancer in urine samples [
28]. Recently published data shows that HPLC (high-performance liquid chromatography) of urine could distinguish bladder cancer patients from non-malignant hematuria patients based on chromatographic absorptions and fluorescence peaks [
29]. Similarly, fluorescence urine analysis using concentration matrices of synchronous spectra could be useful in bladder cancer diagnosis, allowing to distinguish between cancer patients and heumaturia patients [
30]. The new diagnostic strategy could include a label-free optical sensing platform based on DNA strand displacement. Currently, there are no data on bladder cancer detection using this method [
31]. Metabolomic analysis is a very promising and useful tool for the identification of biomarkers; it allows for analyses of urine, blood, and tissue samples. The results enable distinguishing between MIBC and NMIBC patients [
32]. The aforementioned techniques are aimed at the sensitive and early detection of urinary bladder cancer or at discriminations between MIBC and NMIBC. However, markers allowing for the identification of the risk of CIS development have still not been identified.
CIS of the urinary bladder represents the tumors with high risk of progression to MIBC and metastatic disease [
8]. Some data indicate that primary CIS is diagnosed in about 1–3% of newly diagnosed bladder cancers, but some papers report about 20% primary CIS case diagnoses [
33,
34]. Secondary CIS (detected during follow-up) are diagnosed in about 20% of NMIBC cases [
33,
35]. Our method allowed for the identification of seven markers related to an increased risk of CIS-DC of urinary bladder cancers. Some of these markers are well-known molecules involved in cancer biology, but some of them are quite unique, with very limited information on their involvement in cancer development and their relationship with tumors.
We identified two markers that are characterized by limited information: DPY19L3-DT (DPY19L3 Divergent Transcript, ENSG00000267213) and E9PMD0 (ENSG00000258472). DPY19L3-DT belongs to the lncRNA class, but there is no information on the function of this molecule in normal and pathological cells and tissues, while the function of E9PMD0 is linked to the cell division and regulation of the attachment of spindle microtubules to kinetochore [
36].
We also identified five other markers: ADAM28 (ENSG00000042980), Ras-related C3 botulinum toxin substrate 3 (Rac family small GTPase 3, RAC3, ENSG00000169750), targeting protein for Xenopus kinesin-like protein 2 (TPX2, ENSG00000088325), Ankrd13 family of ubiquitin-interacting motif (UIM)-containing proteins (Ankyrin repeat domain-containing protein 13B, ANKRD13B, ENSG00000198720), and TMEM232. Some of them were previously identified as potential cancer markers or targets for molecular anti-cancer therapies, bladder cancers among them.
ADAM28 belongs to the disintegrin and metalloprotease domain (ADAM) family. Its role in cancers is ambivalent: it promotes cancer cells’ proliferation, survival, migration, and metastasis by affecting neoangiogenesis, epithelial-to-mesenchymal transition, and extracellular matrix degradation, but in the tumor microenvironment it shows strong protective effects against deleterious metastasis dissemination [
37]. In bladder cancers, ADAM28 may represent a possible biomarker, since it is overexpressed in bladder transitional cell carcinoma patients and detected in urine [
38,
39]. In our model, its higher expression was found in patients with low-risk cancers.
Another marker identified by our protocol, RAC3, is involved in neuronal development and in tumor progression, by modulating the organization of the cytoskeleton, cell migration, cell proliferation, and reactive oxygen species production. Its expression was found in different cancers, and it is considered as a marker of poor prognosis, metastasis, and a target for molecular-targeted therapies in some human cancers, such as breast or lung (reviewed in [
40]. In our model, the increased expression of RAC3 in high-risk cancers is in line with the existing knowledge and data published by Chen et al. [
41]. It indicates that, in bladder cancer, this molecule can be a potential prognostic marker and a target for molecular medicine.
TPX2 is a microtubule-associated protein, involved in the assembly of mitotic spindles and in cell cycles, cell proliferation, and apoptosis [
42,
43]. TPX2 was found in in silico studies to be related to the risk of the distant metastasis of breast cancers [
44]. In bladder cancer, TPX2 is involved in TPX2-mediated phosphorylation of the AURKA-PI3K-AKT axis [
45]. In addition, heterogeneous nuclear ribonucleoprotein F, by regulating the TPX2 protein, promotes the cell cycle and proliferation of bladder cancer cells [
46]. The proliferation of bladder cancer cells can also be regulated by the interplay between TPX2, p53, and GLIPR1 [
47]. In our model, similar to Yan et al. [
48], a higher expression of TPX2 was found in high-risk cancers. Thus, we conclude that TPX2 plays an important role in the progression of bladder cancers, including CIS in disease course, and represents a good potential marker for targeted therapy.
ANKRD13B is ubiquitin-binding protein that specifically recognizes and binds Lys-63-linked ubiquitin and that is responsible for the internalization of ligand-activated EGFR [
49]. In addition, it is involved in DNA methylation since ANKRD13B (and ANKRD13A and ANKRD13D) form a complex with RNF11 (RING finger protein 11), belonging to the Really Interesting New Gene E3 ligase family (RING) [
49,
50]. Based on our data, we suggest that ANKRD13B could act as a marker of high-risk bladder cancer, since its expression was significantly elevated in these cancers. It could also be a potential molecular target for anticancer therapies.
TMEM232 is a member of the transmembrane protein family (TMEMs), consisting of more than 300 proteins, being components of cellular membranes [
51]. Proteins of this family have differential expression in cancers, but there is limited information on TMEM232. Published data have linked this protein with atopic dermatitis [
52,
53] or with multiple sclerosis [
54]. In our model, the TMEM232 expression pattern was similar to ADAM28, with higher expression in low-risk cancers.
Using externally cross-validated results for the Random Forest classifier and a 75% threshold in our model (
Table 3), the fraction of all patients assigned to a high-risk group was 23.6%, and the fraction of all CIS-DC cases assigned to the high-risk group was 49.9%, while the fraction of CIS-DC in a low-risk group was 10.2%, and that in a high-risk group was 32.9%. The fraction of these patients for the 75% threshold, using cross validation and naive Bayes, logistic regression, and Random Forest classifiers are similar, with very promising diagnostic results for Random Forest. The described method could aid clinicians in identifying high-risk bladder cancer (the risk of CIS in disease course). Thus, it offers a diagnostic tool that allows for the personalization of bladder cancer surveillance, more precise treatment option determinations, and the improvement of bladder cancer prognoses.
To summarize, the identified genes can be used as markers of progression in urinary bladder cancers. Moreover, the increased expression of some identified proteins (RAC3, TPX2, ANKRD13B, and TMEM232) indicates their usefulness as potential targets in molecular-tailored therapies. Some of them require more detailed studies since their biological role, especially in cancer, is unknown, or the data are contradictory (ADAM28, TMEM232, DPY19L3-DT, and E9PMD0). We also conclude that, since we identified seven important genes, their evaluation in routine diagnostic procedures is possible using immunohistochemistry or in situ hybridization. Such a panel would not burden laboratories with high costs and labor. Finally, a ready classifier based on naive Bayes technique is presented in the
Appendix A and
Appendix B, along with an example calculation to enable the research and diagnostics communities to readily analyze applicable data.