Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data

Al-khassaweneh, Mahmood; Bronakowski, Mark; Al-Sharoa, Esraa

doi:10.3390/app132312801

Open AccessArticle

Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data

by

Mahmood Al-khassaweneh

^1,2,*,

Mark Bronakowski

¹ and

Esraa Al-Sharoa

³

¹

Engineering, Computing and Mathematical Sciences, Lewis University, Romeoville, IL 60446, USA

²

Computer Engineering Department, Yarmouk University, Irbid 21163, Jordan

³

Electrical Engineering Department, Jordan University of Science and Technology, Irbid 22110, Jordan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(23), 12801; https://doi.org/10.3390/app132312801

Submission received: 30 August 2023 / Revised: 23 October 2023 / Accepted: 22 November 2023 / Published: 29 November 2023

Download

Browse Figures

Versions Notes

Abstract

:

Cancer, a genetic disease, is considered one of the leading causes of death globally and affects people of all ages. Ribonucleic acid sequencing (RNA-Seq) is a technique used to quantify the expression of genes of interest and can be used to classify cancer tumor types. This paper describes a machine learning technique to classify cancer tissue samples by tumor type, such as breast cancer, lung cancer, colon cancer, and others. More than 60,000 RNA-Seq features were analyzed using six different machine learning classification algorithms, both individually and as an ensemble. Numerous dimensionality reduction techniques addressed the challenges of working with enormous amounts of genetic data. In particular, we were able to reduce the number of features from over 60,000 to 660 in the random forest feature selection and to 68 factor features using factor analysis with an accuracy of 99% in classifying tumor types.

Keywords:

cancer classification; factor analysis; multivariate analysis; dimensionality reduction; genes; RNA-Seq; tumor

1. Introduction

Classification is the process of categorizing or grouping data based on specific criteria. In machine learning and data science, classification is a type of supervised learning algorithm used to predict the class or category of a given observation based on its input features [1].

In classification, a training dataset is used to create a model that can be used to classify new data. The training dataset contains labeled examples of input features and their corresponding output labels or classes. A model is trained on this dataset to learn patterns and relationships between the input features and output classes.

Once the model is trained, it can be used to predict the class or category of new, unseen data by feeding in the input features and using the learned relationships to determine the corresponding output label or class.

Classification is commonly used in a variety of applications, such as spam detection, sentiment analysis, image recognition, and medical diagnosis. There are many different algorithms and techniques used for classification, including logistic regression, decision trees, support vector machines, and neural networks. The choice of algorithm depends on the type and complexity of the data, as well as the desired level of accuracy and interpretability [2].

Cancer classification using machine learning is a rapidly evolving field that involves the development and application of algorithms to automatically classify cancers based on their characteristics. Machine learning algorithms use patterns in data to learn and make predictions, and they have shown great promise in improving cancer diagnosis and treatment.

There are several approaches to cancer classification using machine learning. One common approach is to use supervised learning algorithms, which are trained on labeled data to predict the class of new, unlabeled data. In the context of cancer classification, labeled data may include images of cancer cells or tissue samples that have been annotated with their corresponding cancer type or stage. Supervised learning algorithms, such as decision trees, random forests, and support vector machines, can be used to automatically classify new cancer samples based on their features, such as gene expression profiles, histopathological images, or radiographic images.

Another approach to cancer classification using machine learning is to use unsupervised learning algorithms, which do not require labeled data. Instead, unsupervised learning algorithms identify patterns and groupings in the data on their own. Clustering algorithms, such as k-means or hierarchical clustering, can be used to group cancer samples based on their similarities, allowing researchers to identify subtypes of cancer or to group patients with similar characteristics for personalized treatment.

Deep learning algorithms, such as convolutional neural networks (CNNs) [3], have also been used for cancer classification. These algorithms are particularly effective at analyzing images, such as histopathological or radiographic images, and can automatically learn features that are important for distinguishing between different cancer types or stages.

Overall, cancer classification using machine learning has the potential to improve cancer diagnosis, treatment, and research by providing more accurate and personalized predictions. However, it is important to ensure that these algorithms are validated on diverse and representative datasets to avoid biases and ensure their reliability in clinical practice.

Cancer is the leading cause of death worldwide, accounting for millions of deaths every year [4,5]. Cancer is a genetic disease, caused by changes to genes that trigger cells to grow uncontrollably [6]. A gene is a sequence of deoxyribonucleic acid (DNA) or ribonucleic acid (RNA) that exists in every cell of the body and carries an individual’s genetic code. RNA acts as a messenger, carrying genetic information from DNA for encoding, transmitting, and expressing genetic information into proteins [7].

DNA sequencing is the process of determining the nucleic acid sequence in DNA. Comparing healthy and mutated DNA sequences can diagnose different diseases, including various cancers. DNA sequencing used to take years, but now it can be performed in hours with the advent of precision medicine and next-generation sequencing [8]. RNA-Sequencing (RNA–Seq) is a sequencing technique used to quantify the expression of genes and characterize their sequences at the same time. RNA–Seq gene expressions can be used to classify cancer tumor types [9].

Previous studies have been conducted using machine learning to classify tumor types based on RNA-Sequencing. For example, in [10], the authors proposed using a convolutional neural network, CNN, to classify breast cancer. They achieved an accuracy of

98.76

% by selecting hyper parameters for the CNN model that would give the best performance for the proposed method. In [11], the authors converted RNA-Seq values into 2D images and used them to extract features to train a deep learning model. Another deep learning method was proposed in [12], in which the RNA-Seq values were also converted into 2D images. In [12], the authors used augmentation techniques to increase the dataset 5-fold. By applying the CNN model, an accuracy of

96.90

% was achieved. In [13], the authors conducted an RNA-Seq analysis for tumor classification. The proposed analysis only focused on five target tumor types (prostate, lung, breast, kidney, and colon), but utilized a similar 60,483 RNA-Seq feature dataset. However, the obtained tumor sample set Bonat was unbalanced across the five tumor classes. Extreme unbalanced classes can result in a skewed classification accuracy in favor of the majority class. The authors applied a popular over-sampling technique called the Synthetic Minority Oversampling Technique (SMOTE) which creates “synthetic” examples rather than over-sampling with replacements. With the application of SMOTE, the tumor sample set was balanced with 1500 total samples (300 for each of the tumor types). They also employed dimensionality reduction, using Primary Component Analysis (PCA). For modeling [13] used a broad range of classification models for comparison.

The authors of [14] conducted a study focused on classifying five subtypes of breast cancer (BRCA) using RNA-Seq and machine learning. They used a broad range of classification models for comparison, similar to five of our models. Their dataset consisted of 4731 samples × 19,737 RNA-Seq features. They did not utilize any dimensionality reduction techniques, instead electing to train and test on the full feature set.

In [15], the authors utilized a deep learning model in their classification of 33 tumor types. Their dataset consisted of 10,267 samples × 20,531 RNA-Seq features. They normalized their RNA–Seq features during preprocessing. They also applied dimensionality reduction using a process similar to a high correlation filter and built a reduced gene feature set based on the top pairing of highly correlated RNA–Seq expressions to target tumor types. This process, they called Mutual information (MI), yielded a reduced feature set of 3600 features. Next, they converted the RNA–Seq feature set information into an image format for modeling utilizing a Dense Network Convolutional Neural Network (DenseNetCAM) model. The authors of [16], in their RNA-Seq study, utilized the same dataset as in [15] minus two tumor types (Esophageal (ESCA) and stomach (STAD) cancer). The authors, however, did not employ any dimensionality reduction. They utilized a single classification model, k–Nearest Neighbors (

k = 5

).

In [17], similar to [15], utilized two-dimensional image transformations of the RNA–Seq data to classify 33 tumor types. They also used the same 10,267 damples × 60,484 RNA-Seq feature dataset. The author of [17] normalized the gene feature values and then applied the dimensionality reduction technique of a low variance filter to reduce the RNA-Seq gene features down to 10,381. Next, they reformatted the RNA-Seq gene features as two-dimensional images and classified the tumors using a convoluted neural network model.

The previous studies, however, have limitations. For example, many studies do not utilize dimensionality reduction to reduce and optimize the number of gene features needed for modeling. In fact, to the best of our knowledge, no previous technique has utilized a factor analysis to reduce the number of features. Although some of these produced a high classification accuracy, they did so at the expense of longer computational times and in the end a more complex model. Other studies did not consider the impact of different machine learning techniques on the classification accuracy.

The method presented in this paper addresses these limitations by not only finding the optimum machine learning model for tumor classification accuracy, but also by minimizing the number of gene RNA sequence features required to achieve that high accuracy. We used different reduction techniques, including a factor analysis, for feature reduction. To top it off, the tumor classification accuracy achieved by the optimum method outlined in this paper beats all similar previous studies.

In particular, this paper presents an effective method to classify cancer tissue samples by tumor type (breast cancer, lung cancer, colon cancer, etc.) using machine learning techniques. Over 60,000 gene RNA–Seq features were analyzed and six different machine learning classification algorithms were explored, individually and as an ensemble. The classification algorithms evaluated were logistic regression, k-Nearest Neighbor, decision tree, random forest, neural network, and support vector machine. To train, test, and validate these techniques, a dataset of 5400 tumor samples was used, representing 18 different cancer tumor types. RNA-Seq gene datasets involve enormous amounts of genetic data, which make analysis and predictive modeling both computationally intensive and time-consuming.

This paper is organized as follows. Section 2 provides the details of the composition of the dataset. Challenges with the very large feature set are tackled through the preprocessing stages detailed in Section 3. Classification using individual and ensemble models, along with results validation, is examined in Section 4. Model accuracy comparisons with related studies are explored in Section 5. The conclusions of the paper are offered in Section 6.

2. Dataset

The dataset used for this project was obtained from the National Cancer Institute’s (NCI’s) Pilot 1 Tumor Classifier project (TC1) [18] and downloaded from the NCI Model and Data Clearinghouse (MoDaC) [19]. This TC1 dataset consists of 5400 tumor samples from the Genomic Data Commons (GDC). The 5400 samples are composed of 300 samples each of 18 different tumor types (breast cancer, lung cancer, colon cancer, etc.). A detailed description of the TC1 project team’s efforts to compile and format the TPM gene expression data from the GDC can be found on the Pilot 1 TC1 GitHub page [20]. The dataset contains 60,483 RNA-Seq gene expression features formatted into TPM (transcripts per kilobase million)-scaled values. A representation of the dataset is shown in Table 1. Tumor codes are mapped to their cancer types in Table 2.

3. Preprocessing

The immense size of the RNA–Seq expressed gene feature set creates unique data analysis challenges. On the one hand, having large quantities of high-quality data means more data for the machine learning algorithms to learn from. On the other hand, analyzing and modeling very large datasets is computationally expensive, time-consuming, and possibly unmanageable for the computer system available for the task. All of these factors emphasize the need for robust preprocessing of the data. In addition to traditional data preprocessing for machine learning, it is important to focus on data dimensionality reduction, which involves reducing the number of features in a dataset. These two areas are discussed in detail below.

3.1. Data Transformation

Although the RNA–Seq gene expression feature values are already formatted in a homogeneous TPM scale, most machine learning algorithms perform better if the data are normalized. Consequently, all the feature values were normalized to a scale of 0 to 1. As an initial feature reduction effort, the gene expression features were filtered to remove genes with zero expression across all samples (i.e., columns containing all zeros). This resulted in the removal of 1939 gene column features leaving a new baseline gene dataset of

58, 544

genes × 5400 samples. The histogram in Figure 1 shows the distribution of the means (averages) of the individual gene feature columns after normalization. In preparation for additional analysis and modeling, the tumor code target feature column was one-hot encoded into 18 separate target tumor features.

3.2. Dimensionality Reduction

Dimensionality reduction is a technique that is used to simplify complex high-dimensional data into a lower-dimensional representation while retaining as much information as possible and maintaining or improving the machine learning model’s performance. Moreover, it reduces the effort and cost of the computation. The authors of [21] perform an excellent job of summarizing 12 popular dimensionality reduction techniques, as shown in Figure 2. There are two main types of dimensionality reduction: (1) feature selection techniques that focus on keeping only the most relevant of the original variables, and (2) feature extraction techniques that transform the original set of features into a new compact set of features that maintains the most significant information from the original one. In particular, the new features represent a combination of the original features and capture the key relationships in the data while discarding irrelevant information. Four of the dimensionality reduction methods from Figure 2 were used to create multiple new separate reduced datasets for classification modeling. Additionally, variations in the high correlation filter method were used to produce three other reduced datasets for evaluation.

3.2.1. Low Variance Filter

This reduction method filters out features with a low variance compared to other features. The logic is that these low-variance features have values that are virtually constant across all the samples and do not influence the target variable. For the gene dataset, these low variance features can be thought of as housekeeping genes that have similar expression levels for all samples and have a limited utility for developing predictive models. The histogram in Figure 3 depicts the distribution of the variance for the gene dataset and shows a high portion of low-variance features. Applying a low variance filter to the baseline gene dataset, all gene feature columns with a variance of less than

0.02

were dropped. This resulted in the removal of 23,211 gene column features, leaving a low-variance-filtered dataset of 35,333 gene features for classification modeling.

3.2.2. High Correlation Filter

A high correlation score between two features indicates that they carry related information. This high correlation between two independent variables is termed multicollinearity. Multicollinearity can drastically damage the performance and accuracy of some regression models such as linear and logistic regression models. A high correlation is defined as an r-value greater than

0.6

; if the correlation between a pair of feature columns is greater than 0.6, one of the two features should be dropped. The exception to this rule is that feature columns with a high correlation to a target feature should be retained. When the high correlation filter was applied to the baseline gene dataset,

31,893

feature column pairs were flagged as having multicollinearity. Dropping the features with the lowest correlation to the tumor target features from each pair and accounting for duplicate gene features yielded a high-correlation-filtered dataset of 46,688 gene features for classification modeling.

Three variations of the high correlation filter were also used to produce additional reduced feature datasets. For the first variation, the low variance filter was applied to the results of the high correlation filter, yielding a reduced dataset of

26,963

gene features. The second variation involved retaining only gene feature columns with moderate to high correlations (

| r | \geq 0.5

) to the 18 tumor target features. This resulted in a reduced dataset of 3928 gene features. The third variation entailed compiling the top 250 correlated gene column features for each of the 18 tumor target features. This yielded a dataset of 4082 gene features.

3.3. Random Forest Top Feature ID

This feature reduction technique employs a random forest regression model to generate a large set of decision trees against the target variable, and through a built-in analysis, significant features can be extracted and the topmost features retained, leading to reducing the dimensionality of the data. This method was employed on the gene dataset separately for each of the 18 tumor target features, and the top gene features for each tumor type (see Figure 4) were retained in a new random forest top features dataset of 660 gene features for classification modeling. A summary of all the dimensionality reduction datasets compiled for classification modeling and evaluation is contained in Table 3 under Section 3.3: Random Forest Top Feature ID.

3.4. Factor Analysis

Exploratory factor analysis (EFA) is a technique used to summarize a large set of features as linear combinations of smaller unobserved latent factors. EFA seeks to describe the unseen underlying relationships between features. The EFA model produces a matrix of factor loadings/weights (similar to the coefficients in a regression model). Each factor loading can be interpreted as explaining a particular feature’s proportion of variation explained in the factor [22]. As a rule of thumb, only feature factor loadings greater than or equal to

| 0.4 |

should be used in the transformation of the baseline gene feature dataset into the factor dataset. Additionally, each feature should ideally only load cleanly onto one factor [23].

An EFA was only performed on two preprocessed datasets: the top-250 correlations per tumor dataset (4082 features) and the random forest top features dataset (660 features). Computational limits prevented utilizing an EFA for larger feature datasets.

To confirm that a factor analysis was indeed feasible for the given datasets, the Bartlett Sphericity Test and the Kaiser–Meyer–Olkin (KMO) Test were used.

Bartlett Sphericity Test: The Bartlett Sphericity Test [24] checks whether or not the features (observed variables) are intercorrelated by comparing the observed correlation matrix and the identity matrix. If the two are not the same, the test will be significant. For the test of our feature set, the p-value was 0, signifying that a factor analysis is feasible.
Kaiser–Meyer–Olkin (KMO) Test: The KMO test estimates the proportion of variance among all the observed variables [24]. KMO values range between 0 and 1, with a value of $0.6$ or more indicating a factor analysis is feasible. For the test of our feature set, the KMO value was $1.0$ , again indicating a factor analysis is feasible.

The last step before factor analysis modeling is to determine the number of factors to use in the model. One approach to determining the number of factors is the Kaiser criterion [24]. The Kaiser criterion is an analytical approach which is based on the selection of factors that explain a more significant proportion of variance. The eigenvalue is used as an index for the variance as a portion of the total variance and it reflects the quality of a component as a summary of the data. An eigenvalue of

1.0

indicates that the corresponding factor captures as much variation in the data as a single feature. Generally, an eigenvalue greater than 1 is considered a good selection criterion for a factor. However, with larger feature sets, the eigenvalue cutoff number is more subjective. An alternate approach to determining the number of factors graphically is using a scree plot, which is a plot of the eigenvalues and factor numbers (a graphical representation of the Kaiser criterion). The inflection point or “elbow” in the curve on the scree plot, just before the line flattens out, corresponds to the optimum number of factors to select.

The random forest top features dataset (660 features) was used as an illustration. The Kaiser criterion numerical analysis yielded 68 eigenvalues greater than

1.0

, indicating 68 factors should be considered for the EFA. The scree plot for the same dataset is displayed in Figure 5. The “elbow” in the scree plot curve also indicates 68 factors as the optimum choice.

Python’s Factor Analyzer with a varimax orthogonal rotation was used to compute the factor loadings for these 68 factors. The factor loadings indicate how well a factor can explain a feature. A low factor loading indicates the feature does not belong to the factor. A high factor loading indicates the item belongs to the factor. A factor load of

0.40

was used as a cutoff for the pairing of features to factors. Not all of the features rated high enough to be included in a factor. New factor values were computed by summing the product of the feature values by its corresponding factor load for each instance (sample) in the dataset. Each new factor was then normalized, producing a new factor dataset for classification modeling.

4. Modeling and Validation

Table 3 contains a list of the primary gene datasets that were used in modeling. In preparation for modeling and validation, the datasets were split uniformly into

80 %

for training and

20 %

for testing, ensuring an even representation of each tumor type (i.e., balanced, uniformly sampled).

4.1. Individual Modeling

The following six individual classification models were selected from Python’s sklearn machine learning library for use in the classification of the RNA–Seq gene expression datasets.

4.1.1. Logistic Regression (LogReg)

Logistic regression is a classification method that models the probability that a given data point belongs to a specific category using a sigmoid function (unlike linear regression which uses a linear function) [25]. Multinomial logistic regression is a classification method that generalizes logistic regression to a multiclass problem, and it is the method used here. Optimum classification results for the sklearn logistic regression model were achieved by configuring the optimization solver to Limited-memory BFGS (lbfgs), a quasi-Newton method.

4.1.2. k-Nearest Neighbor (kNN)

kNN classification is an instance-based learning algorithm. The kNN algorithm calculates the distance between the test data point and all the training data points.The correct class for the test data is then predicted by selecting the k number of training points that have the smallest distances to the test data based on a simple majority vote [25]. Optimum classification results for the sklearn KNeighborsClassifier model were achieved by setting the number of nearest neighbors to k = 5.

4.1.3. Decision Tree (DTree)

A decision tree is a classification method that generates a tree-like structure that consists of a root node, branches, internal nodes, and leaf nodes. The algorithm starts at the root node and recursively splits the data into smaller subsets based on the feature tests until it reaches a leaf node. More precisely, each one of the internal nodes represents a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) denotes a class prediction [25]. For the sklearn DecisionTreeClassifier model, the Gini index was found to be the most optimum test criterion for branching with the gene dataset.

4.1.4. Random Forest (RF)

Random forest is an ensemble model that iteratively creates decision trees with changing parameters to improve the predictive accuracy and control over-fitting [25]. For the sklearn RandomForestClassifier model, entropy was found to be the most optimum test criterion for branching with the gene dataset.

4.1.5. Neural Network Multilayer Perceptron (NN)

A multilayer perceptron is a fully connected class of feedforward artificial neural networks. Neural networks consist of many artificial neurons and layers. Each neuron has its own weight associated with it. There can be any number of hidden layers within a feedforward network. Multilayer perceptron neural networks have the advantage of learning non-linear models [25]. Optimum classification results for the sklearn MLPClassifier model were achieved with 100 hidden layers and the solver for weight optimization set to the ‘Adam optimization algorithm’, an extension to stochastic gradient descent.

4.1.6. Support Vector Machine

The support vector machine algorithm works to find a hyperplane that maximizes the separation of the data points to their potential classes in an n-dimensional space. The dimension of the hyperplane depends upon the number of features. If the number of input features is three, then the hyperplane becomes a 2D plane. When the number of features exceeds three, like in the case of the gene dataset, it becomes difficult to imagine the hyperplane. For the sklearn support vector classification (SVC) model, care must be taken when selecting the kernel function to avoid over-fitting when the number of features is much greater than the number of samples (as is the case with the gene dataset) [25]. Optimum classification results for the sklearn SVC model were achieved using the radial basis function (RBF) kernel.

4.2. Ensemble Modeling

The top 3 most accurate models (logistic regression, neural network MLP, and support vector machine) were combined into an ensemble model. The ensemble model was run 10×, with a new random 80:20 training/test dataset combination each time. This methodology of training different learning algorithms on the original training set to generate the base classifiers is known as a heterogeneous ensemble. The accuracy for each run was computed through majority voting by selecting the class label predicted most frequently between the ensemble models. Then, the run accuracies were averaged for an overall combined accuracy. Figure 6 shows the ensemble model flow scheme.

4.3. Validation Results

All models were evaluated and compared based on classification accuracy. Accuracy is defined as the number of correct predictions divided by the total number of predictions. The classification error rate is the proportion of observations that have been misclassified: Error rate = 1 − Accuracy. Model accuracy can be computed from the model’s resulting confusion matrix. The confusion matrix is a table used to evaluate the performance of a classification model. It provides a summary of the number of correct and incorrect predictions made by the model compared to the actual outcomes (i.e., ground truth). A confusion matrix typically consists of four parts: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN):

True positives (TPs): These are the cases where the model correctly predicts a positive outcome when the actual outcome is positive.
True negatives (TNs): These are the cases where the model correctly predicts a negative outcome when the actual outcome is negative.
False positives (FPs): These are the cases where the model incorrectly predicts a positive outcome when the actual outcome is negative.
False negatives (FNs): These are the cases where the model incorrectly predicts a negative outcome when the actual outcome is positive.

Using the above definitions from the confusion matrix, the accuracy is calculated by:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(1)

Figure 7 shows the confusion matrix for the Support Vector Classifier model’s results from the factor analyzed random forest top feature reduced dataset (68 Factor Features). The confusion matrix reveals that there were 1065 correct predictions (15 wrong) out of 1080 classification predictions. This equates to an accuracy of 99%.

The chart in Figure 8 graphically displays the accuracies of all the models and the various gene datasets. The nine dataset categories are along the x-axis. The color lines depict the accuracies of the different models (see legend). The vertical bars show the gene/factor counts for the datasets. The left y-axis shows the model accuracy scale and the right y-axis shows the gene/factor count scale for the datasets.

The baseline gene dataset with its 58,544 features yielded an accuracy of 98% with the logistic regression model. This set the accuracy baseline to measure the other datasets against. The logistic regression model was a steady good performer across all reduced gene feature datasets, with consistent accuracies of 98%. The k–Nearest Neighbor and the decision tree models had the worst accuracy performance and larger accuracy fluctuations between the different reduced gene feature datasets. The other four models, including the ensemble model, performed in the middle of the pack with accuracies between 96 and 97% for the larger gene feature datasets. However, when the feature size dropped, all the models’ accuracy performance increased. The Support Vector Classification model yielded the overall best accuracy of 99% with the 68 factors dataset from the random forest top features.

The consistent accuracy performance of the top models demonstrates the effectiveness of the employed dimension reduction techniques. This improved performance can partially be attributed to the reduced feature datasets preventing model over-fitting. Large feature datasets can result in a model that is too complex and learns irrelevant information or “noise” within the dataset. When the model memorizes the noise and fits too closely to the training set, it becomes “overfitted” and cannot generalize well to new data. The smaller datasets with highly correlated features performed better than the larger datasets with more features. Moreover, the power of the exploratory factor analysis and its ability to represent latent relationships between features is proven by the exceptional 99% accuracy.

Model performance was further evaluated by assessing execution times on the changing feature set sizes. The execution times of neural network, logistic regression, and SVC models were the most impacted by the feature set size. The neural network execution time for the Baseline gene dataset with 58,444 features (no dimensionality reduction) was 14 min 58 s. The logistic regression and SVC models had the next worst execution times on this feature set of 2 min 35 s and 2 min 2 s, respectively. With reduced feature set sizes, all three of these slower models showed significant improvements in speed. The decision tree, random forest, and kNN model execution times were the least impacted by feature size. The decision tree and random forest models had execution times consistently below 1 min regardless of the feature size. The kNN model consistently executed in less than 1 s on all feature set sizes. However, the kNN model had poorer accuracy than the slower models. Once the feature set size was reduced below 4000, all models exhibited execution times below 15 s. These results add further credence to the benefits of dimensionality reduction. The computational time cost of the dimensionality reduction process to identify the optimal dataset features for classification is a one-time cost that does not impact the prediction model’s execution time for new data.

5. Related Work Comparisons

Accuracy and performance comparisons from related studies are shown in Table 4. Our study and all comparative studies obtained their cancer RNA-seq data from the same source, The Cancer Genome Atlas (TCGA) project. What differs is what subset of the available cancer RNA-seq data other studies elected to use. The Tumor Classifier Project (TC1) [26] and Bonat et al.’s [13] study used the same dataset as ours.

The TC1 project did not filter out genes with zero expression across all samples, nor did it normalize gene expression beyond the TPM scale during data preprocessing. Additionally, dimensionality reduction was not utilized. TC1 used a one-dimensional convolutional neural network model to classify the entire 5400 sample ×

60, 483

RNA-Seq feature dataset, with an overall accuracy of

98 %

. Their result was used as the initial accuracy benchmark for our study.

Bonat et al. [13] achieved their best accuracy (

99 %

) with a Light Gradient Boosting Machine model. The Support Vector Classification model in [13] (similar to our top model) only yielded an accuracy of

96 %

compared to our

99 %

.

Table 4 also shows that, in [14], the authors achieved their best accuracy (89%) with a logistic regression model. The highest accuracy for [15] was

97 %

. Both [16] and [17] achieved an overall classification accuracy of 92%.

Table 5 contains comparisons of individual tumor classification accuracies from related studies. The cancers with accuracies below

98 %

are highlighted in red font. Across all the studies, bladder (BLCA), cervical (CESC), both types of kidney (KIRP, KIRP), and lung squamous cell (LUSC) cancers were the most challenging to accurately classify. Overall, our Support Vector Classification model applied to the 68 factors dataset from the random forest top features outperformed other similar studies across the majority of individual tumors.

6. Conclusions

RNA–Seq gene expressions can effectively be used to classify cancer tumor types. One computational challenge is tackling the enormous gene feature set size. Dimensionality reduction techniques can be leveraged to reduce the number of features while retaining as much information as possible and maintaining or improving the machine learning model’s performance. Exploratory factor analysis is one notable, but often overlooked, dimensionality technique that can greatly aid in reducing the feature dimensions. Through such techniques, a classification accuracy of 99% was achieved utilizing a Support Vector Classification model on a reduced dataset of 68 factors (combining 660 gene features).

Author Contributions

Conceptualization, M.B., M.A.-k. and E.A.-S.; Methodology, M.B., M.A.-k. and E.A.-S.; Software, M.B. and M.A.-k.; Validation, M.B., M.A.-k. and E.A.-S.; Formal analysis, M.B. and M.A.-k.; Investigation, M.B. and M.A.-k.; Resources, M.B. and M.A.-k.; Data curation, M.B. and M.A.-k.; Writing—original draft, M.B. and M.A.-k.; Writing—review & editing, M.B., M.A.-k. and E.A.-S.; Visualization, M.B., M.A.-k. and E.A.-S.; Supervision, M.A.-k. and E.A.-S.; Project administration, M.A.-k. and E.A.-S.; Funding acquisition, M.A.-k. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by Dr. Scholl Foundation, Lewis University, grant number 229028, Yarmouk University and Jordan University of Science and Technology.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used for this project was obtained from the National Cancer Institute’s (NCI’s) Pilot 1 Tumor Classifier project (TC1) [18] and downloaded from the NCI Model and Data Clearinghouse (MoDaC) [19].

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BFGS	Broyden–Fletcher–Goldfarb–Shanno algorithm
DNA	Deoxyribonucleic acid
DTree	Decision tree
EFA	Exploratory factor analysis
GDC	Genomic Data Commons
KMO	Kaiser–Meyer–Olkin
kNN	k-Nearest Neighbor
LogReg	Logistic regression
NCI	National Cancer Institute
NN	Neural network multilayer perceptron
RBF	Radial basis function
RF	Random forest
RNA	Ribonucleic acid
RNA-Seq	Ribonucleic acid sequencing
SMOTE	Synthetic Minority Oversampling Technique
SVC	Support Vector Classification
TC1	Pilot 1 Tumor Classifier project
TPM	Transcripts per kilobase million

References

Bronakowski, M.; Al-khassaweneh, M.; Al Bataineh, A. Automatic Detection of Clickbait Headlines Using Semantic Analysis and Machine Learning Techniques. Appl. Sci. 2023, 13, 2456. [Google Scholar] [CrossRef]
Huette, J.; Al-Khassaweneh, M.; Oakley, J. Using Machine Learning Techniques for Clickbait Classification. In Proceedings of the 2022 IEEE International Conference on Electro Information Technology (eIT), Romeoville, IL, USA, 19–21 May 2022; pp. 091–095. [Google Scholar]
Al Bataineh, A.; Kaur, D.; Al-khassaweneh, M.; Al-sharoa, E. Automated CNN Architectural Design: A Simple and Efficient Methodology for Computer Vision Tasks. Mathematics 2023, 11, 1141. [Google Scholar] [CrossRef]
Siegel, R.L.; Miller, K.D.; Wagle, N.S.; Jemal, A. Cancer statistics. CA Cancer J. Clin. 2023, 73, 17–48. [Google Scholar] [CrossRef]
O’keefe, W.; Ide, B.; Al-Khassaweneh, M.; Abuomar, O.; Szczurek, P. A cnn approach for skin cancer classification. In Proceedings of the 2021 International Conference on Information Technology (ICIT), Amman, Jordan, 14–15 July 2021; pp. 472–475. [Google Scholar]
Available online: https://www.cancer.gov/about-cancer/understanding/what-is-cancer (accessed on 4 December 2022).
Available online: https://www.genome.gov/genetics-glossary/RNA-Ribonucleic-Acid (accessed on 4 December 2022).
Behjati, S.; Tarpey, P.S. What is next generation sequencing? Arch. Dis. Child.-Educ. Pract. 2013, 98, 236–238. [Google Scholar] [CrossRef] [PubMed]
Mardis, E.R. DNA sequencing technologies: 2006–2016. Nat. Protoc. 2017, 12, 213–218. [Google Scholar] [CrossRef]
Elbashir, M.K.; Ezz, M.; Mohammed, M.; Saloum, S.S. Lightweight convolutional neural network for breast cancer classification using RNA-seq gene expression data. IEEE Access 2019, 7, 185338–185348. [Google Scholar] [CrossRef]
Rukhsar, L.; Bangyal, W.H.; Ali Khan, M.S.; Ag Ibrahim, A.A.; Nisar, K.; Rawat, D.B. Analyzing RNA-seq gene expression data using deep learning approaches for cancer classification. Appl. Sci. 2022, 12, 1850. [Google Scholar] [CrossRef]
Khalifa, N.E.M.; Taha, M.H.N.; Ali, D.E.; Slowik, A.; Hassanien, A.E. Artificial intelligence technique for gene expression by tumor RNA-Seq data: A novel optimized deep learning approach. IEEE Access 2020, 8, 22874–22883. [Google Scholar] [CrossRef]
Bonat, E. Available online: https://medium.com/@ernest-bonat/rna-seq-gene-expression-classification-using-machine-learning-algorithms-de862e60bfd0 (accessed on 4 December 2022).
Cascianelli, S.; Molineris, I.; Isella, C.; Masseroli, M.; Medico, E. Machine learning for RNA sequencing-based intrinsic subtyping of breast cancer. Sci. Rep. 2020, 10, 14071. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Dai, X.; Luo, H.; Yan, C.; Zhang, G.; Luo, J. MI_DenseNetCAM: A Novel Pan-Cancer Classification and Prediction Method Based on Mutual Information and Deep Learning Model. Front. Genet. 2021, 12, 670232. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Kang, K.; Krahn, J.M.; Croutwater, N.; Lee, K.; Umbach, D.M.; Li, L. A comprehensive genomic pan-cancer classification using The Cancer Genome Atlas gene expression data. BMC Genom. 2017, 18, 508. [Google Scholar] [CrossRef] [PubMed]
Lyu, B.; Haque, A. Deep learning based tumor type classification using gene expression data. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Washington, DC, USA, 29 August–1 September 2018; pp. 89–96. [Google Scholar]
Available online: https://datascience.cancer.gov/collaborations/joint-design-advanced-computing/cellular-pilot (accessed on 4 December 2022).
Available online: https://modac.cancer.gov/searchTab?dme_data_id=NCI-DME-MS01-6996872 (accessed on 4 December 2022).
Available online: https://github.com/CBIIT/NCI-DOE-Collab-Pilot1-Tumor_Classifier-hardening/blob/master/TC1-dataprep.ipynb (accessed on 4 December 2022).
Zebari, R.; Abdulazeez, A.; Zeebaree, D.; Zebari, D.; Saeed, J. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appl. Sci. Technol. Trends 2020, 1, 56–70. [Google Scholar] [CrossRef]
Available online: https://pypi.org/project/factor-analyzer/ (accessed on 4 December 2022).
Rahn, M. Factor Analysis: A Short Introduction, Part 5: Dropping Unimportant Variables from your Analysis. Anal. Factor 2014. Available online: https://www.theanalysisfactor.com/factor-analysis-5/ (accessed on 4 December 2022).
Toth, G. Available online: https://www.datasklr.com/principal-component-analysis-and-factor-analysis/factor-analysis (accessed on 4 December 2022).
Available online: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning (accessed on 4 December 2022).
Available online: https://github.com/ECP-CANDLE/Benchmarks/tree/master/Pilot1/TC1 (accessed on 4 December 2022).

Figure 1. Histogram—gene feature column distribution of means.

Figure 2. Dimensionality reduction techniques.

Figure 3. Histogram—gene variance distribution.

Figure 4. Random forest—number of top gene features per tumor type.

Figure 5. Scree plot for a factor analysis of the RF top features dataset.

Figure 6. Ensemble model majority voting flow scheme.

Figure 7. Confusion matrix for the SVC classification model of factors—RF top features dataset.

Figure 8. Model accuracy chart.

Table 1. Dataset of RNA-Seq gene expression features.

		Gene
Sample #	Tumor Code	1	2	3	…	60,482	60,483
0	29	$13.4$	$0.0$	$17.4$	…	$13.8$	$0.0$
1	5	$9.4$	$0.0$	$15.5$	…	$12.8$	$0.0$
2	29	$0.0$	$0.0$	$17.5$	…	$15.4$	$0.0$
…	…	…	…	…	…	…	…
5397	14	$9.5$	$11.6$	$15.3$	…	$13.0$	$0.0$
5398	11	$0.0$	$0.0$	$16.0$	…	$13.5$	$0.0$
5399	11	$0.0$	$0.0$	$16.0$	…	$13.2$	$0.0$

Table 2. Eighteem cancer tumor types.

Tumor Code	Tumor Cancer Type	Tumor Code	Tumor Cancer Type
1	Leukemia-LAML	16	Liver-LIHC
3	Bladder-BLCA	17	Lung-LUAD
4	Brain-LLG	18	Lung-LUSC
5	Breast-BRCA	22	Ovarian-OV
6	Cervical-CESC	25	Prostate-PRAD
8	Colon-COAD	29	Skin-SKCM
11	Head/Neck-HNSC	30	Stomach-STAD
14	Kidney-KIRC	33	Thyroid-THCA
15	Kidney-KIRP	35	Uterine-UCEC

Table 3. Summary of gene datasets modeled.

#	Dataset	# Features
1	Baseline dataset	58,544
2	High-Correlation-Filtered dataset	46,688
3	Low-Variance-Filtered dataset	35,333
4	Combined High Correlation/Low Variance dataset	26,963
5	r ≥ 0.5 Correlated dataset	3928
6	Top 250 Correlations per Tumor dataset	4082
7	RF Top Features dataset	660
8	Factors—Top 250 Correlations per Tumor dataset	356
9	Factors—Top RF Features dataset	68

Table 4. Overall accuracy comparisons.

Related Work	Best Model and Dataset Type	Accuracy
Tumor Classifier Project (TC1) 2019 [26]	1D Convoluted Neural Network Model – 5400 Samples × 60,484 RNA-Seq Features – 18 Target Tumor Types	$98 %$
Bonat 2022 [13]	Light Gradient Boosting Machine Model – 1500 Samples × 60,484 RNA-Seq Features – 5 Target Tumor Types	$99 %$ ¹
Cascianelli et al. 2020 [14]	Logistic Regression Model – 4731 Samples × 19,737 RNA-Seq Features – 5 Target Breast Tumor Subtypes	$89 %$ ²
Wang et al. 2021 [15]	MI_DenseNetCAM Model (Deep Learning) ⁴ – 10,267 Samples × 20,531 RNA-Seq Features – Reduced to 10,267 × 3600 RNA–Seq Features – 33 Target Tumor Types	$97 %$
Li et al. 2017 [16]	k–Nearest Neighbors Model – 9096 Samples × 20,000 RNA-Seq Features – 31 Target Tumor Types	$92 %$ ³
Lyu et al. 2018 [17]	2D Convoluted Neural Network Model – 10,267 Samples × 20,531 RNA-Seq Features – 33 Target Tumor Types	$92 %$ ³
Our Top Model 2023	Support Vector Machine – 5400 Samples × 60,484 RNA-Seq Features – Reduced to 5400 Samples × 68 Factor Features – 18 Target Tumor Types	$99 %$ ²

¹ Synthetic Minority Oversampling Technique (SMOTE) applied to correct imbalanced classes in the dataset (original sample size = 801). ² The breast cancer classification accuracy in our top model was 100%. ³ Comparable accuracy on the 18 tumors from our study = 97%. ⁴ Transformed RNA–Seq gene expressions into a 2D image for modeling.

Table 5. Individual tumor accuracy comparisons.

		Wang [15]	Li [16]	Lyu [17]	Proposed Model
	Tumor	Accuracy	Accuracy	Accuracy	Accuracy
1	Leukemia—LAML	$100 %$	$100 %$	$100 %$	$100 %$
3	Bladder— BLCA	$98 %$	91%	$97 %$	$97 %$
4	Brain—LLG	$100 %$	$100 %$	$98 %$	$100 %$
5	Breast—BRCA	$99 %$	$99 %$	$99 %$	$100 %$
6	Cervical—CESC	$95 %$	$94 %$	$93 %$	$93 %$
8	Colon—COAD	$95 %$	$99 %$	$95 %$	$100 %$
11	Head/Neck—HNSC	$99 %$	$99 %$	$98 %$	$98 %$
14	Kidney— KIRC	$94 %$	$96 %$	$95 %$	$95 %$
15	Kidney—KIRP	$94 %$	$92 %$	$93 %$	$97 %$
16	Liver—LIHC	$97 %$	$98 %$	$97 %$	$100 %$
17	Lung—LUAD	$95 %$	$96 %$	$95 %$	$98 %$
18	Lung—LUSC	$93 %$	$88 %$	$91 %$	$97 %$
22	Ovarian—OV	$100 %$	$100 %$	$99 %$	$100 %$
25	Prostate—PRAD	$99 %$	$100 %$	$100 %$	$100 %$
29	Skin—SKCM	$98 %$	$97 %$	$98 %$	$100 %$
30	Stomach—STAD	$96 %$	––	$96 %$	$100 %$
33	Thyroid—THCA	$100 %$	$100 %$	$100 %$	$100 %$
35	Uterine—UCEC	$95 %$	$96 %$	$96 %$	$100 %$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Al-khassaweneh, M.; Bronakowski, M.; Al-Sharoa, E. Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data. Appl. Sci. 2023, 13, 12801. https://doi.org/10.3390/app132312801

AMA Style

Al-khassaweneh M, Bronakowski M, Al-Sharoa E. Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data. Applied Sciences. 2023; 13(23):12801. https://doi.org/10.3390/app132312801

Chicago/Turabian Style

Al-khassaweneh, Mahmood, Mark Bronakowski, and Esraa Al-Sharoa. 2023. "Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data" Applied Sciences 13, no. 23: 12801. https://doi.org/10.3390/app132312801

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multivariate and Dimensionality-Reduction-Based Machine Learning Techniques for Tumor Classification of RNA-Seq Data

Abstract

1. Introduction

2. Dataset

3. Preprocessing

3.1. Data Transformation

3.2. Dimensionality Reduction

3.2.1. Low Variance Filter

3.2.2. High Correlation Filter

3.3. Random Forest Top Feature ID

3.4. Factor Analysis

4. Modeling and Validation

4.1. Individual Modeling

4.1.1. Logistic Regression (LogReg)

4.1.2. k-Nearest Neighbor (kNN)

4.1.3. Decision Tree (DTree)

4.1.4. Random Forest (RF)

4.1.5. Neural Network Multilayer Perceptron (NN)

4.1.6. Support Vector Machine

4.2. Ensemble Modeling

4.3. Validation Results

5. Related Work Comparisons

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI