1. Introduction
Cancer is a common disease caused by certain abnormal changes to genes that are responsible for cell division and growth. These recognizable changes include mutations of the DNA that make up genes. Generally, cancer cells exhibit significantly more genetic changes than normal cells, although cancerous tumours show different specific combinations of genetic alterations in different people. However, a few of these recognizable changes may be the result of the cancer rather than its cause. As cancer grows, additional changes will occur [
1], such as identifying clinical problems, scientific data, and how to apply the emerging subspecialty of neurooncology. Therefore, the early detection of cancer can improve the treatment possibilities and increase the survival rate of patients. Thus, developing appropriate methodologies that can effectively distinguish among tumour subtypes is vital. Early diagnosis of cancer is essential for sufficient and effective treatment because every cancer type requires specific treatment.
According to the World Cancer Organization, approximately 4610 cases of central nervous system (CNS) tumours and various brain tumours were expected to be diagnosed in 2018 in children under the age of 20 in the United States. After leukaemia, brain cancer and others, tumours of the CNS are the second most common type of cancer among children; the rate of such tumours has never reached more than 26% among children under the age of one year [
1,
2]. In 2019, 1,762,450 new cancer cases of brain and other nervous system tumours were reported in the United States, and the number of associated deaths was estimated to be 606,880. Thus, it is important to develop a methodology of detecting cancer in the early stages before the tumour worsens, thereby reducing the risk of death [
3].
The conventional methods of diagnosing most existing diseases depend on human experience to recognize cases that correspond to confirmed data patterns. However, this age-old diagnosis methodology is subject to human error and imprecise diagnosis and is both time-consuming and labour-intensive, thus causing undue stress throughout the whole process. As an alternative, computer-aided diagnosis (CAD) systems based on machine learning have been continually improving and are employed to support specialists in the determination of diagnosis decisions [
3,
4,
5].
Most current CAD systems for medical diagnosis depend on diverse information, such as medical laboratory tests (e.g., blood tests and magnetic resonance imaging (MRI)), medical indicators (finger tremors and lung signs or symptoms), and various types of digital images (such as X-rays and ultrasound images). However, physical medical examinations pose a risk of transmission of infection through tools and other channels, such as scratching of the skin while taking a blood sample [
6,
7,
8]. X-rays are harmful because of the exposure of body cells to radiation. The quality of ultrasound data depends on the accuracy and integrity of the image, which are affected by various factors, such as the presence of air between the surface of the skin and the tool and image blur. A system that depends on gene expression data collected using DNA microarrays can effectively solve these problems. Such a method can be used to diagnose cancer in the early stages, unlike other methods that use different kinds of image processing techniques. The challenges that arise in microarray classification are mainly centred on dimensionality and classification accuracy [
6,
7].
Methodologies that depend on gene expression profiles have been able to detect cancer since their inception. In previous work, exhaustive efforts have been made to achieve the best results. Researchers have achieved excellent results in the classification of cancer based on gene expression profiles using various gene selection approaches and classifiers [
9].
There is more than one approach to gene selection, including filter, wrapper, embedded and hybrid approaches, and every approach has its advantages and disadvantages. For example, the advantages of the filter approach are that it is very fast and computationally simple, whereas its main disadvantage is that each feature is measured separately, and thus, it does not consider the dependencies among features. The wrapper approach has the advantage of enabling an exhaustive search to generate optimal solutions, whereas its disadvantage is that it has a higher risk of overfitting than filter techniques do. The embedded approach has the same benefits as the wrapper approach while achieving better computational complexity; however, it is still prone to overfitting. Hybrid approaches can combine the advantages of various other approaches, but the time complexity may increase [
10,
11].
This paper addresses the problem of medical diagnosis and presents an intelligent decision support system (IDSS) for cancer diagnosis based on gene expression profiles from DNA microarray datasets. DNA microarray technology has been efficiently applied to analyse gene expression in many experimental studies. Usually, the number of features (M) in a microarray dataset is very large (usually in the thousands), while the number of samples (N) is small (not exceeding hundreds) [
12]. This paper proposes an IDSS for CNS cancer classification based on gene expression profiles. The proposed system combines the information gain (IG), the grey wolf optimization (GWO) algorithm and the support vector machine (SVM) algorithm: the IG is used for selecting important genes (features) from the input matrix, GWO is used for feature reduction, and an SVM classifier is used for cancer diagnosis.
The remainder of this paper is organized as follows.
Section 2 reviews some important previous works.
Section 3 describes the proposed methodology; this section includes an overview of the IG filter approach for feature selection, the GWO algorithm for feature reduction and the SVM algorithm for classification.
Section 4 describes the datasets and presents the reports and the analyses of results, and
Section 5 presents the conclusions and possibilities for future works.
2. Related Works
In all the previous studies listed below, gene expression profiles were used for the classification of cancer based on various methodologies. These methodologies were applied to datasets with a small number of samples, a large number of features, and the additional characteristics listed in
Table 1 below.
Salem, Hanaa, et al. [
10] reported research on human cancer classification using gene expression profiles. The feature selection methodology used in this study exploited the IG for gene selection from the input microarray data. The methodology also exploited a genetic algorithm (GA) to reduce the number of features selected based on the IG. The final task of cancer classification (or diagnosis) was accomplished by means of genetic programming (GP). The framework was verified by considering seven cancer gene expression datasets (Lung Cancer-Ontario, Leukaemia72, DLBCL Harvard, Prostate, Lung Cancer-Michigan, Colon, and Central Nervous System). The authors achieved classification accuracies of 85.48% (Colon), 86.67% (Central Nervous System), 97.06% (Leukaemia72), 74.4% (Lung Cancer-Ontario), 100% (Lung Cancer-Michigan), 94.8% (DLBCL Harvard) and 100% (Prostate).
As a hybrid gene selection technique, J. Bennet, C., et al. [
12] proposed an ensemble feature selection technique that is a mixture of the support vector machine-recursive feature elimination (SVM-RFE) approach and the Based Bayes Error Filter (BBF) for attribute selection. These researchers employed SVM-RFE to sort the attributes and the BBF to remove redundant sorted attributes. The SVM algorithm was then used for classification. The best classification accuracy on the Leukaemia72 dataset reached 97.2%.
The authors of [
15] presented an analysis of the behaviour of a GA with k-nearest-neighbours (KNN) and SVM classifiers on ten datasets. Using the GA, they reduced the number of features selected by three filters. In the final stage, the KNN and SVM algorithms were used for classification. The authors used five-fold cross-validation, and on most datasets, the classification accuracy achieved with the SVM classifier was the same as that achieved with the KNN classifier; the results differed only for the Leukaemia72 dataset (Lung Cancer-Michigan: 100%, Ovarian: 100%, Central Nervous System: 81.25%, DLBCL Harvard: 100%, DLBCL Outcome: 77.27%, Prostate Outcome: 85.71%, Leukaemia72: 100% using SVM and 95.45% using KNN, Colon: 95%, Lung Harvard2: 100%, and Prostate: 92%).
In [
14], an ensemble of five filters (IG, correlation-based feature selection (CFS), consistency-based, interaction, and ReliefF) and three classifiers (naïve Bayes, C4.5, and IB1) was proposed. The researchers used a simple voting scheme for classification. They applied their methodology to 10 microarray datasets with ten-fold cross-validation, and the best classification accuracies they obtained were 100% (Lung), 89.05% (Colon), 100% (Ovarian), 70% (Central Nervous System), 71.89% (Breast), 98.75% (Leukaemia72), 90.6% (Prostate), 68.42% (GCM), and 95.67% (Lymphoma).
In [
13], the researcher applied a GA for gene selection in combination with four classifiers for cancer classification using a gene expression dataset. The classifiers used were naïve Bayes, SVM, oneR, and decision tree classifiers. The author analysed the results obtained by applying the methodology to six datasets, namely, Lymphoma, Lung, CNS, Colon, Leukaemia38, and Leukaemia72, on which the best classification accuracies were 97%, 99.4%, 82.3%, 88.8%, 100%, and 98.6%, respectively.
Salem, Hanaa, et al. [
11] presented research on the early classification of breast cancer based on gene expression profiles. Their system first extracts important genes from the input microarray data using the IG methodology and then exploits a GA to reduce the features selected in this way. The best results in this study were achieved with an IG threshold value of 0.7 for the breast cancer dataset; with this threshold, the features were initially reduced from 24,481 attributes to 45 attributes by the IG methodology and were further reduced to only 22 attributes by applying a GA with a population size of one hundred and twenty rounds of evaluation. The classification accuracy reached 100%.
Bouazza, Sara Haddou, et al. [
16] presented research on cancer classification using SVM and KNN classifiers. In this research, the effects achieved on three gene expression profile datasets (Prostate, Colon, and Leukaemia) were studied using multiple techniques for attribute selection (such as Fisher, ReliefF, SNR, and T-Statistics) with both KNN and SVM classifiers. The best results were obtained by combining the SNR attribute selection technique with the SVM classifier. The best classification accuracies achieved in this study with the SNR feature selector and the KNN classifier were 95% for the Colon and Prostate datasets and 100% for the Leukaemia dataset.
5. Conclusions and Future Works
In this research, an enhanced IDSS is proposed based on IG feature selection, the GWO algorithm and SVM classification. The proposed system employs the IG method for initial feature selection, while GWO is used to reduce the number of selected features to enable more accurate sample classification by the SVM. Two microarray datasets are used as benchmarks to evaluate the proposed methodology. The experimental results indicate that the proposed methodology is able to enhance the stability of the classification accuracy as well as the feature selection. The best results are obtained when combining the IG approach with both the GWO and SVM algorithms; the classification accuracy reaches 94.87% for breast cancer data and 95.935% for colon cancer data. In future work, additional classifiers, such as decision tree, neural network, and k-nearest neighbour should be added to the system. In addition, there is a possibility of testing the system on other benchmarks, especially binary-class datasets and test the reliability of diagnosis after repeated sampling of tissue from the same patient.