*Article* **Leukemia Image Segmentation Using a Hybrid Histogram-Based Soft Covering Rough K-Means Clustering Algorithm**

### **Hannah Inbarani H. 1, Ahmad Taher Azar 2,3,\* and Jothi G <sup>4</sup>**


Received: 1 December 2019; Accepted: 1 January 2020; Published: 19 January 2020

**Abstract:** Segmenting an image of a nucleus is one of the most essential tasks in a leukemia diagnostic system. Accurate and rapid segmentation methods help the physicians identify the diseases and provide better treatment at the appropriate time. Recently, hybrid clustering algorithms have started being widely used for image segmentation in medical image processing. In this article, a novel hybrid histogram-based soft covering rough k-means clustering (HSCRKM) algorithm for leukemia nucleus image segmentation is discussed. This algorithm combines the strengths of a soft covering rough set and rough k-means clustering. The histogram method was utilized to identify the number of clusters to avoid random initialization. Different types of features such as gray level co-occurrence matrix (GLCM), color, and shape-based features were extracted from the segmented image of the nucleus. Machine learning prediction algorithms were applied to classify the cancerous and non-cancerous cells. The proposed strategy is compared with an existing clustering algorithm, and the efficiency is evaluated based on the prediction metrics. The experimental results show that the HSCRKM method efficiently segments the nucleus, and it is also inferred that logistic regression and neural network perform better than other prediction algorithms.

**Keywords:** leukemia nucleus image; segmentation; soft covering rough set; clustering; machine learning algorithm; soft computing

### **1. Introduction**

Due to the growth of advanced medical imaging modalities, it is very difficult to analyze the medical images manually. For this reason, an advanced and efficient computer-aided system is needed to diagnose the diseases. This will help the hematologist to begin the treatment at the right time and increase the patient's survival rate. Leukemia is a cancer of blood-forming tissues that affects the bone marrow. Leukemia is caused by the proliferation of abnormal white blood cells in the body. Leukemia is mostly affected by people living in developed countries and children aged 14 or under. As per the National Cancer Institute (NCI) statistics, in the United States, it is expected that there will be 62,130 persons as new cases for cancer treatment and 245,000 cases that are fatal or very serious [1]. In India, leukemia stands at ninth position among diseases (tumors) among children [2,3]. Leukemia is identified into two broad categories such as acute and chronic. Acute forms of leukemia occur when the number of immature blood cells increases, and it is the most common type of leukemia in children. Segmenting an image of a nucleus is one of the major challenging tasks in leukemia diagnosis. Recently, soft computing plays an important role in many research areas such as medical image processing, pattern recognition, big data analytics, Internet of Things (IoT) analysis, bioinformatics, and so on.

The rough set theory [4] was proposed by Pawlak in 1982. This concept is an extension of set theory for the study of intelligent systems characterized by insufficient and incomplete information. This classical rough set theory is based on equivalence relations, but it can also be extended to covering based rough sets [5–7]. In 1999, Molodtsov [8] proposed the concept of a soft set, which can be seen as a new mathematical approach to vagueness. The absence of any restrictions on the approximate description in soft set theory makes this theory very versatile and easily applicable in practice. Maji et al. [9] improved Molodtsov's idea by introducing several operations in soft set theory. In [10], the researcher investigated a soft covering-based rough set as a new kind of soft rough set. This method is a combination of a covering soft set and rough set. In [11], a covering-based rough k-means clustering approach is applied to segment the leukemia nucleus. The advantage of covering-based subsets is that they generate upper and lower approximations by using the covering feature, which brings about more roughness. Since different clusters give rise to different results, determination of the number of clusters is a difficult task in clustering-based segmentation. To overcome this limitation, the hybrid histogram-based soft covering rough k-means clustering algorithm (HSCRKM) is introduced to segment the image of the leukemia nucleus. In this algorithm, the peak values of the histogram of an image are identified and the number of clusters is initialized. This will avoid the random initialization of a number of clusters. Here, soft covering approximation space is also included. The term 'covering soft set' is more accurate than 'soft rough set.' It also combines the strengths of covering soft set theory and the rough k-means clustering algorithm to effectively segment the image of the nucleus. Soft covering rough approximation is utilized to find the lower and upper approximation values. The performance of the HSCRKM algorithm is evaluated using existing algorithms such as k-means clustering, fuzzy c-means clustering, and particle swarm optimization (PSO)-based clustering. Different types of features such as GLCM-0, GLCM-45, GLCM-90, GLCM-135, and shape color-based features are extracted from the segmented leukemia nucleus image. Nowadays, a lot of machine learning algorithms are applied to predict the degree of sickness. The state-of-art machine learning prediction algorithms such as neural networks (NN) [12], logistic regression (LR) [13], support vector machine (SVM) [14], naive Bayes (NB) [15], k-nearest neighborhood (KNN) [13], decision tree (DT) [13], and random forest (RF) [16] are applied to classify the cancerous and non-cancerous leukemia cells. The empirical results show that logistic regression and neural network efficiently predict the blast and non-blast cells when compared with other prediction algorithms.

The main objective of this research work is to develop a diagnostic approach for the identification of acute lymphoblastic leukemia blast cells using image processing and computational intelligence techniques. In experimental analysis, relevant image processing and computational intelligence techniques are applied in order to select the most suitable approach for the delineation of acute lymphoblastic leukemia cells. The following objectives have been formulated in order to predict leukemia: to apply computational intelligence-based algorithms for the segmentation of acute lymphoblastic leukemia blast cells in images and to apply machine learning algorithms to evaluate the performance of the proposed method.

The contribution of this study is summarized as follows. To find the number of clusters using the peak value of a histogram image and compute the lower and upper approximation values based on the soft covering approximation space, three clustering methods—k-means, FCM, and PSO-based clustering—are preferred for segmentation comparison. Through these methods, different kinds of features are extracted, and the efficiency of the proposed algorithm is assessed using machine learning prediction algorithms. The HSCRKM achieves the successful results i.e., above 80% when compared with the existing clustering algorithms. Therefore, it can be concluded that the HSCRKM clustering algorithm works effectively with the other clustering algorithms.

In the clustering algorithm, defining the number of clusters is a very difficult task. To overcome this limitation, the proposed algorithm identifies the peak values of the histogram of an image and initializes the number of clusters. This is one of the advantages of our proposed method, which avoids the random initialization of a number of clusters. The next advantage of the HSCRKM algorithm is that it combines the strengths of covering soft set theory and the rough k-means clustering algorithm to effectively segment the image of the nucleus. Based on a literature review, the term 'covering soft set' is more accurate than 'soft rough set', since it gives a better result than the soft rough set for several applications. In covering rough sets, the lower and upper approximation values are computed based on the soft covering approximation space.

Morphologically, a lymphoblast consists of a massive nucleus of irregular shape and size. In blood sample images, it is difficult to identify the cytoplasm, because it appears rarely and even if it does, it looks intensely colored. The nucleus and cytoplasm of lymphoblast cells reflect the morphological and functional changes. Feature extraction plays a main role in the assessment of leukemia in blood samples. After segmenting the nucleus using the proposed HSCRKM algorithm, salient features are extracted. It reduces the amount of data space and the working time of an image. In this research, different kinds of features are extracted such as gray level co-occurrence matrix (GLCM), color, and shape-based features. These were measured from every channel of the segmented nucleus image. The efficiency of the proposed algorithm is assessed using machine learning prediction algorithms. The performance of the segmentation algorithms was analyzed in the light of different machine learning (ML) prediction methods. With respect to HSCRKM clustering algorithms, most of the ML methods (except naive Bayes) achieved greater than 80% prediction accuracy compared with the existing clustering algorithms, viz., k-means clustering, fuzzy c-means clustering, and rough k-means clustering. It is inferred that the proposed clustering algorithms are more effective in segmenting the nucleus image. Due to the effective segmentation process, the extracted features have increased the prediction accuracy. To evaluate the experimental results, we have empirically set the best accuracy to be greater than 80%. The outline of the proposed system is shown in Figure 1.

**Figure 1.** Outline of the proposed image segmentation process.

The rest of the research report is organized as follows. Section 2 reviews the related literature on clustering-based segmentation algorithms. Section 3 describes the methods of the proposed algorithm and its results. The empirical results are discussed in Section 4. Section 5 states the conclusion and indicates the future direction of this research.

#### **2. Related Literature**

In recent years, a lot of clustering algorithms have been developed for segmenting medical images. Petal [17] applied k-means clustering for segmentation and the Zack algorithm for clustered white blood cells (WBCs). The features—namely, the mean, standard deviation, area, elongation, perimeter, color etc.—are extracted, and support vector machine (SVM) was used to classify the cells. The proposed algorithm effectively segmented the WBCs, which produced 93.57% accuracy. For this experiment, 27 images from the Acute Lymphoblastic Leukemia Image Database (ALL-IDB) were utilized.

Two bare-bones particle swarm optimization (BBPSO) algorithms with and without subswarms were introduced by Srisukkham et al. in 2017 [18] to diagnosis the leukemia cells. A stimulating discriminant measure (SDM)-based clustering algorithm that combined with the genetic algorithm (GA) was employed to segment the nucleus, cytoplasm, and background regions. The relevant features were extracted; then, various feature selection methods such as particle swarm optimization (PSO), cuckoo search (CS), and dragonfly algorithm (DA) were applied to select the optimal features and reduce the dimensions. An average geometric mean was computed with different sizes of training and test samples to evaluate the performance of the proposed methods. The BBPSO and binary BBPSO algorithms produced 91% to 96% of the geometric mean value.

Su [19] developed two stages of segmentation process using k-means clustering and HMRF (hidden Markov random field), which are used to group the six different types of AML cells from the bone marrow images. The segmentation algorithm achieved an accuracy of 96% to 98% (average) when compared with other existing segmentation methods.

In [20], k-means and fuzzy c-means clustering algorithms were applied to segment the brain tumor images. Various feature reduction algorithms, namely probabilistic principal component analysis (PPCA), expectation maximization-based principal component analysis (EM-PCA), the generalized Hebbian algorithm (GHA), and adaptive principal component extraction (APEX) were employed to reduce the dimensions of the feature set. The produced coefficient of variance (CV) values for k-means and Fuzzy C-mean (FCM) are 0.4582 and 0.1224, respectively.

In [21], potential field segmentation was employed to segment the MRI brain tumor images. This method achieved the standard deviation of 0.283, the average value of 0.517, and the median values of 0.644. From the experimental results, it was observed that ensemble methods generated better segmentations.

Küçükkülahlı [22] and Namburu [23] identified the number of cluster values in the clustering algorithm using the peak value of the histogram of an image. In [22], the automatic segmentation method using the histogram-based k-means clustering algorithm was developed. In [23], the soft fuzzy rough c-means clustering algorithm (SFRCM) was used to segment the MRI brain tumor images. The proposed SRFCM algorithm achieved a better Jaccard coefficient value of 0.97 for without noise and 0.79 for with 9% Gaussian noise when compared with the existing clustering algorithms namely, k-means, rough k-means (RKM), rough fuzzy c-means (RFCM), and generalized rough c-means (GFCM).

Ali [24] introduced a new clustering algorithm based on neutrosophic orthogonal matrices (CANOM) to segment the dental X-ray images. The experimental results show that the CANOM simplified silhouette width criterion (SSWC) index is 0.941 and the FCM is 0.02. CANOM is also better than Otsu and eSFCM with the values being 0.657 and 0.647, respectively. The value of CANOM is 47 times larger than that of FCM and 1.43 times larger than those of Otsu and eSFCM.

In [25], the unsupervised fuzzy c-means (FCM) clustering technique was employed for prostate cancer MRI images. The derived average dice similarity, Jaccard index, sensitivity, specificity, mean absolute difference, and Hausdorff distance is 88.68, 81.26, 90.71, 88.09, 88.09, 3.5, and 4.1 respectively.

In [26], the proposed multi-Otsu thresholding-based segmentation method can successfully segmented the CT image stacks. In addition, it sows the distribution characteristics of different components in three dimensions.

In [27], the enhanced adaptive fuzzy k-means (AFKM) algorithm was used to detect the three regions such as white matter (WM), gray matter (GM), and cerebrospinal fluid spaces (CSF) in the brain images. AFKM performed better than FCM, which produced a minimum mean square error (MSE) value of 2.2441.

In [28], the clustering method intuitionistic fuzzy c-means (IFCM) was applied for medical image segmentation. It is observed from the experimental results that the proposed method outperformed other algorithms that achieved the average quantitative index 0.95. The chronic wound region was detected using fuzzy spectral clustering in [29]. The proposed method produced 91.5% segmentation accuracy, an 86.7% Dice index, a Jaccard score of 79.0%, 87.3% sensitivity, and 95.7% specificity.

In [30], the convolutional neural networks (CNN) approach is applied to identify the subtypes of leukemia. It is observed from the experimental results that the CNN model achieves 88.25% and 81.74% accuracy for leukemia and healthy cells, respectively. From the literature review, it is inferred that the clustering-based algorithms were applied to segment the tumor region. A brief review of the literature on various clustering methods in image segmentation and their performances appears in Table 1.


Overview of the literature on clustering algorithms.

**Table**

**1.**


**Table 1.** *Cont.*

#### **3. Methods**

*3.1. Basics of Soft Covering Based Rough Set*

This section describes the basic properties of soft covering-based rough approximation [11].

**Definition 1.** *Let CG* = (*F*, *A*) *be a covering soft set over U if F*(*a*) - <sup>∅</sup>, <sup>∀</sup>*<sup>a</sup>* <sup>∈</sup> *A. The pair <sup>S</sup>* <sup>=</sup> (*U*,*CG*) *is known as soft covering approximation space. For a set X* ⊆ *U, the soft covering lower and upper approximations are, respectively, defined as*

$$\underline{S}\_\*(X) = \cup\_{a \in A} \{ F(a) : F(a) \subseteq X \} \tag{1}$$

$$\overrightarrow{S}^{\circ}(X) = \cup \{ Md\_{\mathcal{S}}(\mathbf{x}) : \mathbf{x} \in X \}. \tag{2}$$

*In addition,*

$$\mathcal{S}\_{\text{pos}}\left(X\right) = \underline{\mathcal{S}}\_{\bullet}\left(X\right) \tag{3}$$

$$S\_{\text{avg}}(X) = \mathcal{U} - \overleftarrow{\mathcal{S}}^{\star}(X) \tag{4}$$

$$S\_{bmu}\left(X\right) = \overline{S}^\*(X) - \underline{S}\_\*\left(X\right) \tag{5}$$

*are called the soft covering positive, negative, and boundary regions of X, respectively [11].*

**Definition 2.** *Let S* = (*U*,*CG*) *be a soft covering approximation space. If S* ∗ (*X*) <sup>=</sup> *<sup>S</sup>*<sup>∗</sup> (*X*)*, then subset X* ⊆ *U is called soft covering. X is said to be a soft covering based rough set if S* ∗ (*X*) - *<sup>S</sup>*<sup>∗</sup> (*X*)*.*

The soft covering based rough set can be applied to image segmentation with the following considerations.


$$\begin{array}{l} F(Cl\_{G1}) = \{ \mathbf{x}\_{2'} \mathbf{x}\_{3'} \mathbf{x}\_{4} \} \\ F(Cl\_{G2}) = \{ \mathbf{x}\_{1'} \mathbf{x}\_{4'} \} \\ F(Cl\_{G3}) = \{ \mathbf{x}\_{1'} \mathbf{x}\_{3} \} \end{array}$$

Let (*F*, *A*) be represented as (*F*, *A*) = {*F* (*ClG*) | *ClG* ∈ *A*}. The soft covering based rough set representation of the above example is given by

$$(F, A) = \left\{ \begin{array}{l} \text{Cl}\_{\text{G1}} = \langle \mathbf{x}\_{2}, \mathbf{x}\_{3}, \mathbf{x}\_{4} \rangle \\ \text{Cl}\_{\text{G2}} = \langle \mathbf{x}\_{1}, \mathbf{x}\_{4} \rangle \\ \text{Cl}\_{\text{G3}} = \langle \mathbf{x}\_{1}, \mathbf{x}\_{3} \rangle \end{array} \right\}.$$

A tabular presentation of soft sets appears in Table 2. If *xi* ∈ *F*(*ClGi*), then the value is one; else, it is zero.


**Table 2.** Soft covering-based rough set representation of an image.

#### *3.2. The Proposed Histogram-Based Soft Covering Rough K-Means Clustering*

The proposed histogram-based soft covering rough k-means clustering is summarized in Algorithm 1. The combination of the covering soft set and rough set gives rise to a new kind of soft rough sets. Based on the covering soft sets, soft covering rough approximation was proposed by Yüksel et al. in 2014 [11,31], which is more accurate than the soft rough set. Here, we establish a rough k-means clustering using soft covering-based rough approximation to segment the image of the leukemia nucleus. Let *<sup>S</sup>*<sup>∗</sup> (*X*), *S* ∗ (*X*) be denoted as soft covering lower and upper approximation, and for *S*∗ (*X*) ∈ *S* ∗ (*X*) *i*.*e*., in soft covering-based rough k-means clustering, the lower approximation is a subset of the upper approximation. The pixel data *Xn* = (*x*1, *x*2, ... ... .*xn*) of the lower approximation surely belong to the cluster; in this way, they can not have a place with some other cluster. The pixel data *Xn* = (*x*1, *x*2, ... ... .*xn*) in an upper approximation may belong to the cluster. Since their participation is dubious, they should be an individual set from an upper approximation of at least another cluster. The distance between the pixel data *Xn* and the mean *smk* is defined as [32]

$$d(X\_n, sm\_k) = \|X\_n - sm\_k\|.\tag{6}$$

The cluster center *smk* i.e., the mean, is computed using the following equation:

$$sm\_k = \begin{cases} \left. \begin{array}{c} w\_{low} \sum\_{\substack{\mathbf{S}\_n \in \mathbf{S} \\ \mathbf{S}\_k \end{array}} \frac{\mathbf{X}\_n}{\left| \mathbf{S}\_k \right|} + \left. w\_{upp} \sum\_{\substack{\mathbf{X}\_n \in \mathbf{S} \\ \mathbf{X}\_n \in \mathbf{S} \end{array}} \frac{\mathbf{X}\_n}{\left| \mathbf{S}\_k \right|} for \\_\mathbf{S} \right. \neq \phi \\\ \sum\_{\substack{\mathbf{X}\_n \in \mathbf{S}^\* \\ \mathbf{X}\_n \in \mathbf{S}^\* \end{array}} \frac{\mathbf{X}\_n}{\left| \mathbf{S}^\* \right|} \phi \\\ \text{otherwise} \end{cases} \tag{7}$$

where  *S*∗*k*  indicates the numbers of pixels in the lower approximation of the cluster *<sup>k</sup>* and *S* ∗ *k*  is the number of pixels in the upper approximation of the cluster *k*. The weight parameters *wlow* and *wupp* stress the significance of the lower and upper approximation of the cluster.

Explanation: In this algorithm, identify the peak value of a histogram image and use it to define the number of clusters *k*. Initially, assign each pixel *Xn* = (*x*1, *x*2, ... ... .*xn*) to exactly one lower approximation. Here, soft covering-based rough approximation is applied instead of rough approximation. Determine the new means *smk* using Equation (7). Assign each pixel data to its closest mean using Equation (6). Compute the distance between each pixel *Xn* with centroid *smk* i.e., *d*(*Xn*,*smk*). For each pixel, compute the relative distance (RD). If it is greater than the threshold, then the pixel is put into the upper approximation of the cluster *k*; otherwise, put it into the lower approximation of the cluster *h*. This algorithm is continued until all the data objects close to the cluster remain unchanged. Finally, the clustered image is labeled by the cluster index, and the segmented image of the nucleus is extracted.

*Algorithm* **1 :** *Based Soft Covering Rough K* − *Means Clustering Algorithm*

*Input* : *Img* (*Xn*), *k*, *wlow*, *wupp*, δ *Output* : *Segmented Nucleus Image Segneu Initialization* : *Xn* = (*x*1, *x*2, ... ... .*xn*) // *n* = *no*. *o f pixels in an image* K = hist(*Img*(*Xn*)) *No*. *o f Clusters f ound using the peak value o f a histogram image wlow* = *Lower Approximation Weight wupp* = *Upper Approximation Weight* δ = *Threshold Value Randomly assign each pixel into exactly one lower approximation*. *Procedure* : *Step***1** : *Randomly assign each pixel*- *s data to the so f t covering approximations Step***2** : *Compute cluster centers smk using Equation* (7) *Step***3** : *Assign the pixels to the approximations*. *The pixel data Xn determine its closest mean smh*. *sdmin <sup>n</sup>*,*<sup>h</sup>* <sup>=</sup> *<sup>d</sup>*(*Xn*,*smh*) <sup>=</sup> *min <sup>k</sup>*=1,2,...*<sup>K</sup> <sup>d</sup>*(*Xn*,*smk*) *Assign Xn to the upper approximation o f the cluster h* : *Xn* ∈ *S* ∗ *h*. *Step***4** : *The relative distance is de fined as RD* = *d*(*Xn*,*smk*) − *d*(*Xn*,*smh*) *ST* = {*t* : *RD* ≤ δ ∪ *h k*}. *I f ST* φ *then XnS* ∗ *<sup>t</sup>* ∀*t* ∈ *T*. *Else*, *XnS*∗*h*. *Step***5** : *Check the convergence o f the algorithm*; *i f not*, *make it converge*, *and then continue with Step* 1. *Step***6** : *Lable the image by cluster index and extract the leukemia nucleus* (*Segneu*).

#### *3.3. Performance Assessment for Segmentation Algorithms*

After preprocessing, a novel HSCRKM algorithm is applied for leukemia nucleus image segmentation. The peak values of histogram are identified, and these values will automatically be assigned the number of clusters (K). In each iteration, the k value will change. The range of weight of the lower and boundary region in rough k-means algorithms is (0.0 <= *wlow*, *wbon* <= 1.0). The relative threshold in the HSCRKM algorithm is defined as δ <= 1.0. The parameters' values are assigned as *wlow* = 0.7, *wbon* = 0.3, and δ = 0.5. These values give possible stable results in rough k-means [30]. Figure 2 illustrates the segmentation results produced by the proposed HSCRKM algorithm.

**Figure 2.** *Cont.*

**Figure 2.** Segmentation results produced by the proposed histogram-based soft covering rough k-means clustering (HSCRKM) algorithm.

In Figure 2, the first column displays the original image, the second column shows the histogram of an image that helps find the number of clusters (K), the third column displays the clustered image, and the last column displays the extracted nucleus. It is observed that if the k value is at its minimum, we get a better segmentation result. This helps reduce the processing time. The parameters utilized in the clustering algorithms are presented in Figure 3.

**Figure 3.** Parameters utilized in clustering algorithms.

Figure 4 shows the sample output of leukemia image segmentation using existing clustering algorithms such as k-means clustering, FCM clustering, and PSO-based clustering algorithms. Here, the number of clusters k is assigned as three using the elbow method.

**Figure 4.** Segmentation results by k-means, FCM, and particle swarm optimization (PSO) algorithms.

#### **4. Results and Discussion**

#### *4.1. Dataset*

The Acute Lymphoblastic Leukemia Image Database (ALL-IDB) datasets were used for this experiment. These data were downloaded from the website www.dti.unimi.it/fscotti/all/ [33–36]. There were 368 images—175 benign and 193 malignant—taken for this experimental analysis. Digital microscopes are not suitable, since they are usually designed to work in the RGB color space. In the preprocessing step, all the RGB input images are converted into a LAB color space.

### *4.2. Feature Extraction*

The segmented image data were too large, and it was very difficult to process them. Feature extraction is a technique to extract the relevant informative data of a segmented image. This will reduce the processing speed, time, and dimensionality of an image. In this research, 21 shape and color-based features—namely, the area, perimeter, roundness, elongation, form\_factor, length\_to\_diameter\_ratio, compactness, discrete\_fourier\_transform, mean\_of\_harra\_coefficient, h\_coefficient, v\_coefficient, variance\_of\_harra\_coefficient, h\_coefficient, v\_coefficient, mean\_colour\_intensity for red, green, and blue, hue, saturation, value component, and class attribute—were extracted [37]. Twenty-three texture-based features—namely, angular\_second\_moment, entropy, dissimilarity,

contrast, inverse\_difference, correlation, homogeneity, autocorrelation, cluster\_shade, cluster\_prominence, maximum\_probability, sum\_of\_squares, sum\_average, sum\_variance, sum\_entropy, difference\_variance, difference\_entropy, information\_measures\_correlation1, information\_measures\_correlation2, maximal\_correlation\_ coefficient, inverse\_difference\_normalized, inverse\_difference\_moment\_normalized, and class attribute were extracted. These features are derived from the gray level co-occurrence matrix (GLCM) in directions 0◦, 45◦, 90◦, and 135◦ [38,39]. From the literature review, we found that these features are widely used for leukemia image analysis.

#### *4.3. Performance Assessments of Segmentation Algorithms*

The empirical results are interpreted in two ways. First, we analyze the efficiency of various clustering-based segmentation algorithms through state-of-the-art machine learning algorithms. Secondly, we compare the machine learning methods using some evaluation measures such as receiver operating characteristic (ROC) curve analysis and kappa statistics. The extracted feature set was fed into the machine learning (ML) prediction algorithms to classify the segments indicating the tumor and non-tumor leukemia in the image. In this experiment, there were seven ML algorithms—namely, logistic regression (LR), naive Bayes (NB), support vector machine (SVM), k-nearest neighborhood (KNN), neural network (NN), random forest (RF), and decision tree (DT)—were used to evaluate the performance of the clustering algorithms.

The performance of the machine learning prediction algorithm was analyzed using various evaluation metrics such as accuracy (A), precision (P), recall (R), F1 measure, area under the ROC Curve (AUC), mean absolute error (MAE), and coefficient of determination (R2) [40,41]. It is noted that the prediction value of R2 lies between 0 and 1 for no-fit and perfect fit, respectively.

The classification results of k-means clustering, FCM clustering, PSO-based clustering, and the proposed HSCRKM clustering algorithms are presented in Tables 3–6, respectively. The performance of the segmentation algorithms was analyzed through different machine learning prediction methods. The experimental results show that the proposed method HSCRKM clustering algorithm performs better than the existing algorithms. On a closer look at the overall performance of the proposed method, it is believed that logistic regression and neural network perform well when compared to other prediction algorithms and also produce the highest classification accuracy i.e., 93%. It is also observed that the naive Bayes method produces the lowest classification accuracy rate i.e., 58%.

Table 3 presents the performance analysis of k-means clustering. The LR, NN, and RF algorithms produce the highest classification accuracy of 79%. The NB algorithm gives the minimum accuracy of 65%. KNN and DT produce 72% accuracy and SVM produces 74% accuracy. The overall performance of k-means clustering was 69%, which is computed by the average accuracy of all the datasets with all the ML algorithms.

Table 5 presents the performance analysis of FCM clustering. The LR, DT, and RF algorithms achieve the maximum accuracy value of 88%. Obviously, it gives the lowest mean absolute error (MAE) value. Similar to k-means clustering, the NB algorithm gives the lowest accuracy value of 81% when compared to other algorithms. The SVM and NN give the accuracy of 83% and 84%, respectively. The overall accuracy of FCM clustering is 77%.


**Table 3.** Performance analysis of k-means clustering. A: accuracy, AUC: area under the receiver operating characteristic curve, DT: decision tree, KNN: k-nearest neighborhood, LR: logistic regression, MAE: mean absolute error, NB: naive Bayes, NN: neural network, P: precision, R: recall, RF: random forest.

**Average Overall Accuracy 69%**

**Table 4.** Performance analysis of FCM clustering.



**Table 4.** *Cont.*

Table 5 shows the efficiency of the algorithm for PSO-based clustering. In this table, it is noted that the NN method attains 90% accuracy. The LR, SVM, KNN, and RF methods give above 80% of the classification accuracy. The NB algorithm again provides the minimum accuracy of 67%. The overall classification accuracy of PSO-based clustering is 78%.


**Table 5.** Performance analysis of PSO-based clustering.


**Table 5.** *Cont.*

The performance analysis of the HSCRKM algorithm is shown in Table 6. The LR, NN, and DT algorithms achieve 93% classification accuracy. NB, KNN, and RF give accuracy values of 84%, 85%, and 86%, respectively. It is also interesting to note that the SVM gives the minimum accuracy, i.e., 84%. The overall accuracy of the HSCRKM algorithm is 82%. The proposed method leads the accuracy of 13% for k-means clustering, 5% for FCM, and 4% for PSO-based clustering. It means that the accurate segmentation produces the best performance. The experimental results show that the HSCRKM algorithm accurately segments the nucleus. From the literature review report, the various authors produce above 90% accuracy. However, they are using a very small number of images for the experiments. In this research, around 350 images are used to evaluate the performance of the proposed HSCRKM algorithm.


**Table 6.** Performance analysis of the HSCRKM algorithm.


**Table 6.** *Cont.*

Figure 5 shows the overall prediction accuracy for various machine learning algorithms. With respect to k-means clustering, all the machine learning algorithms produce the lowest prediction accuracy i.e., below 80%. It is noted that with respect to PSO and FCM, some of the ML methods (i.e., logistic regression, random forest, and decision tree) attain above 80% prediction accuracy. With respect to the HSCRKM clustering algorithm, most of the ML methods (except naive Bayes) achieve above 80% prediction accuracy. It can also be inferred that the proposed HSCRKM clustering algorithm efficiently segment the nucleus, and the extracted features (based on the segments) probably increase the prediction accuracy. To interpret the experimental results, we are manually preserving the best accuracy range as above 80%.

**Figure 5.** Overall prediction accuracy.

#### *4.4. Performance Assessments of Machine Learning Algorithms*

#### 4.4.1. Kappa Statistics

Figure 6 shows a comparison of the performances for various prediction algorithms and the proposed HSCRKM algorithm in terms of Cohen's kappa value [42], which is a statistical measure used to evaluate the inter-rater reliability of the classifier. The reliability rate lies on a 0 to 1 scale, where "1" means perfect agreement and less than "1" means less than perfect agreement. With respect to the shape and color-based feature dataset, the proposed algorithm produces a substantial agreement range [43] (i.e., 0.61 to 0.80) amidst all the existing prediction algorithms taken up for study. Compared with other machine learning algorithms, neural networks have the capability to learn and model nonlinear and complex relationships. It also has the ability to perceive all possible interactions between predictor variables and the availability of multiple training algorithms. From the figure, it is noted that the neural network algorithm produces the highest kappa value (i.e., 0.67 to 0.85), which means perfect agreement for prediction. It also produces the highest classification accuracy when compared with other machine learning algorithms.

**Figure 6.** Kappa value for HSCRKM clustering.

#### 4.4.2. ROC Curve Analysis

Receiver operating characteristic (ROC) curve analysis is a widely used validation method to evaluate the diagnostic ability of the various prediction algorithms [44]. It can be generated by plotting the cumulative distribution function of the true positive rate versus the false positive rate. If the ROC curve of the prediction algorithm appears in the top left corner, then the algorithm accurately predicts disease. If it is closer to the diagonal line, then the performance of the prediction algorithm is less accurate. Figure 7 depicts the ROC curve analysis for the proposed algorithm HSCRKM. The ROC curve is generated for all the extracted datasets, namely GLCM\_0, GLCM\_45, GLCM\_90, GLCM\_135, and Shape\_Colour. From Figure 6, we inferred that the shape and color-based feature datasets produce the highest accuracy values when compared to another dataset. It is noted that decision tree, random forest, and SVM attain similar prediction accuracy. So, the curves appear in the same orientation. It is also noted that the neural network (NN) and logistic regression (LR) algorithms performed better than the other machine learning algorithms. Those algorithms curve lines almost appeared in the top left

corner of the graph. The naive Bayes algorithm curve line is executed near the diagonal line. So, this method probably attains minimum accuracy compared to the other ML algorithms.

**Figure 7.** *Cont.*

**Figure 7.** ROC curve analysis for HSCRKM clustering.

### **5. Conclusions and Future Work**

Clustering is an unsupervised classification method that is widely employed for image segmentation. Throughout the present research, a hybrid histogram-based soft covering rough k-means clustering algorithm is proposed to segment the image of the leukemia nucleus. In this method, the histogram is used to initialize the number of clusters. The main advantage of this method is that it applies the soft covering rough approximation instead of rough approximation. It is a new kind of soft rough set that efficiently deals with uncertainties. The results are interpreted in the following two ways. (1) The efficiency of the proposed technique is compared with the popular and frequently used clustering algorithms such as k-means clustering, FCM, and PSO-based clustering. (2) The state-of-the-art prediction techniques in machine learning (ML) were compared using evolution metrics.

From the experimental results, it is inferred that the HSCRKM clustering algorithm and all of the ML methods (except for naive Bayes) achieve above 80% prediction accuracy. It is also noted that logistic regression and neural network provide on average above 90% accuracy, which performs better than other prediction methods. The limitation of this method is that when we go for multiple color images such as satellite images, agricultural images, photographs etc., the number of peak values in the histogram is increased, and consequently the processing time is also increased. This method is more suitable for the segmentation of medical images and the extraction of specific portions with high clarity (for deep study). In the future, bio-inspired algorithms could be used to optimize the number of clusters.

**Author Contributions:** Conceptualization, J.G., A.T.A., and H.I.H.; methodology, J.G., A.T.A.; software, J.G.; validation, J.G., A.T.A., and H.I.H.; formal analysis, A.T.A. and H.I.H.; investigation, H.I.H.; resources, H.I.H.; data curation, J.G.; writing—original draft preparation, J.G., A.T.A., and H.I.H.; writing—review and editing, A.T.A. and H.I.H.; visualization, J.G.; funding acquisition, A.T.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research is funded by Prince Sultan University, Riyadh, Saudi Arabia.

**Acknowledgments:** The authors would like to thank Prince Sultan University, Riyadh, Saudi Arabia for supporting and funding this work. Special acknowledgment to Robotics and Internet-of-Things Lab (RIOTU) at Prince Sultan University, Riyadh, SA. In addition, the authors wish to acknowledge the editor and anonymous reviewers for their insightful comments, which have improved the quality of this publication.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
