Next Article in Journal
Generalization of the Landauer Principle for Computing Devices Based on Many-Valued Logic
Previous Article in Journal
Stochastic Schrödinger Equations and Conditional States: A General Non-Markovian Quantum Electron Transport Simulator for THz Electronics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Classification of Hepatitis Viruses from Sequencing Chromatograms Using Multiscale Permutation Entropy and Support Vector Machines

Department of Statistics, Yildiz Technical University, 34220 Istanbul, Turkey
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(12), 1149; https://doi.org/10.3390/e21121149
Submission received: 21 October 2019 / Revised: 21 November 2019 / Accepted: 23 November 2019 / Published: 25 November 2019
(This article belongs to the Section Entropy and Biology)

Abstract

:
Classifying nucleic acid trace files is an important issue in molecular biology researches. For the purpose of obtaining better classification performance, the question of which features are used and what classifier is implemented to best represent the properties of nucleic acid trace files plays a vital role. In this study, different feature extraction methods based on statistical and entropy theory are utilized to discriminate deoxyribonucleic acid chromatograms, and distinguishing their signals visually is almost impossible. Extracted features are used as the input feature set for the classifiers of Support Vector Machines (SVM) with different kernel functions. The proposed framework is applied to a total number of 200 hepatitis nucleic acid trace files which consist of Hepatitis B Virus (HBV) and Hepatitis C Virus (HCV). While the use of statistical-based feature extraction methods allows representing the properties of hepatitis nucleic acid trace files with descriptive measures such as mean, median and standard deviation, entropy-based feature extraction methods including permutation entropy and multiscale permutation entropy enable quantifying the complexity of these files. The results indicate that using statistical and entropy-based features produces exceptionally high performances in terms of accuracies (reached at nearly 99%) in classifying HBV and HCV.

1. Introduction

Investigating sequencing of nucleotides from deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) is an important research area in the field of molecular genetics. Although next-generation sequencing platforms have been getting more applicable than capiller electrophoresis recently, capiller electrophoresis studies are required for the verification of next generation sequencing results. Since assessing the huge number of subjects is time-consuming and cost-intensive, it is widely used in small sized projects. In order to determine the sequencing of interested nucleic acid (DNA-RNA) regions, millions of copies are amplified with the process named polymerase chain reaction (PCR). In PCR, the interested RNA region is also converted to DNA copies. After that, the PCR product is prepared for capiller electrophoresis. As a result, base calling signals (trace files) are obtained from the bases of DNA, namely Adenine (A), Cytosine (C), Guanine (G), and Thymine (T) which are labeled with four different fluorescent dyes. A different analysis (i.e., mutation analysis, identification of subtypes of a virus known as a genotyping process and determination of species) can be accomplished from the results of a chromatogram that includes related sequences for the specific purpose.
Sequential data modeling for the purpose of discriminating and classifying DNA chromatograms becomes very popular with the rapid development of sequencing techniques in molecular genetics and bioinformatics [1,2,3,4]. While some types of chromatograms can be manually recognized by an expert, it is hard to classify many of them without using any special software. Hepatitis B Virus (HBV) and Hepatitis C Virus (HCV) base calling signals are two types of hepatitis DNA chromatograms, and distinguishing these signals visually is impossible. Therefore, classification of hepatitis DNA trace files is an important issue in utilizing resources efficiently. The illustrations of HBV and HCV trace file samples are given in Figure 1 and Figure 2, respectively. These figures show the peaks of bases A, C, G, and T with different colors.
This study deals with classification of HBV and HCV trace files with support vector machines (SVM) using statistical and entropy-based feature extraction methods. Trace files are also accepted in a time-series manner and exhibit complex characteristics. In order to measure the complexity, approximate entropy (ApEn) was suggested by Pincus with the application of an electroencephalogram (EEG) series [5]. ApEn depends on the length of series and it takes a lower value than expected when the length is short. Since the sample entropy (SampEn) proposed by [6] is not affected by the length, it is more consistent than ApEn [7]. In addition, the calculation of SampEn is easier than ApEn. Permutation entropy (PE) [8] estimates the complexity of non-stationary, noisy and non-linear series by comparing neighboring values. These traditional entropy measures have been utilized for different purposes, especially in fault diagnosis and vibroarthographic (VAG) and electroencephalogram (EEG) signal-processing studies. However, none of these entropy measures are applicable for the systems which show structures on multiple spatial and temporal scales. In order to estimate multiscale complexity, multiscale entropy (MSE) was first suggested by Costa, Gollberg and Peng for the physiologic time series data [9]. The superiority of MSE was then showed by different time series data such as cardiac inter-beat (RR) [10,11] and human gait [12]. MSE uses single scale SampEn in order to quantify the complexity of coarse-grained series and different studies showed that it has some limitations based on the characteristics (e.g., existence of outliers, stationary) and length of the series [13,14,15]. A modification of MSE, namely multiscale permutation entropy (MPE), uses PE instead of SampEn, and the procedure is more robust to artifacts and observational noise in the time series data [16]. Except for techniques based on statistical theory, various researchers have offered suggestions with regard to using single and multiscale entropy measures as a feature extraction technique for classification of sequential data. While some studies investigate the performance of different sophisticated classifiers with extracted features using ApEn, SampEn and/or PE [17,18,19,20,21,22], others handle multiscale-based technique such as MPE [23,24]. These entropy measures have been used in biological time series data for the purpose of both quantifying complexity and the extraction of features in classification. However, there is no work available that uses entropy-based feature extraction methods for DNA trace files, especially for hepatitis DNA trace files. On the other hand, sophisticated classifiers within the concept of machine learning have been investigated in terms of their classification ability in the studies of DNA sequencing [25,26,27,28]. However, SVM [29,30] has been reported as a powerful classification tool compared with other supervised algorithms in recent years [31], and to the best our knowledge, none of the hepatitis DNA studies have examined SVM as a classifier.
In this study, a new framework for the classification of HBV and HCV trace files based on features extracted from four bases (i.e., A, C, G, D) of hepatitis DNA chromatograms is presented. Statistical-based and entropy-based features are extracted from the hepatitis DNA trace files. By using a statistical-based feature extraction method, it is intended to capture the statistical properties of four bases belonging to HBV and HCV with computing the values of mean, median and standard deviation. On the other hand, an entropy-based feature extraction method based on PE and MPE is utilized for the purpose of quantifying the complexity of these bases. Therefore, 24 computationally efficient features are extracted and later their different combinations are fed to SVM with different kernel functions such as linear, polynomial (Poly.) and radial bases (RBF).
The rest of this study is organized as follows. Section 2 includes materials and methods of the study. The proposed framework is also given in this section. Model comparison results are presented in Section 3. A discussion and some concluding remarks are provided in Section 4 and Section 5, respectively.

2. Material and Methods

2.1. Dataset

Hepatitis DNA trace files are obtained by “Phred” [32], which is widely used in academic and commercial laboratories as a base-calling software, embedded in a ABI-3730 capillary sequencer device (Applied Biosystems, Foster City, CA, USA) for DNA sequence traces. The data consists of 200 trace files, of which 96 are HBV and 104 are HCV. Type of hepatitis is taken as the dependent variable of constructed SVM models, which has a binary form. Therefore, hepatitis type is labeled as +1 if the trace file represents HBV, otherwise it is labeled as –1. Each trace file consists of four base calling signals time series shaped like Gaussian peaks (A, C, G, T bases). A typical segment of a DNA trace file is illustrated in Figure 3. Each base calling signal in the trace file is converted to an array using the “scfread” function of MATLAB 2017a software [33].

2.2. Feature Extraction

Identifying features extracted from the raw data correctly plays a vital role in the purpose of achieving better classification. Since the intensities of four base calling signals are different from each other, the raw data include trace files which cannot be directly used as an input for the classification process. For this reason, raw data should be converted to a mathematical representation which gives constant values. Different methods can be used in order to represent the raw data. Two types of extraction methods for arrays obtained from hepatitis DNA trace files are introduced in this study: (1) statistical-based feature extraction and (2) entropy-based feature extraction. Following subsections which provide the formulations of how features are extracted from a given base calling signal based on statistical and entropy theory. All calculations are carried out using MATLAB 2017a software [33].

2.2.1. Statistical-Based Feature Extraction Method

Three statistical features based on descriptive statistical theory, including central tendency measures (mean and median) and a central dispersion measure (standard deviation), are used. These are frequently used statistics that reflect the property of DNA trace files [26,27].
Let N denote the length of each base-calling signal. The data points (located on X-axis in Figure 3) corresponding to signal intensities (located on Y-axis in Figure 3) for base calling signals A, C, G, and T can be expressed as y A ( 1 , 2 , , N ) , y C ( 1 , 2 , , N ) , y G = ( 1 , 2 , , N ) , and y T = ( 1 , 2 , , N ) , respectively. The mean and standard deviation formulas for each base calling signal are given as follows where j = A ,   C ,   G ,   T :
μ j = 1 N i = 1 N y j ( i )
σ j = ( 1 N i = 1 N ( y j ( i ) μ j ) 2 ) 1 2
The intensities of base calling signal j are ordered and then the middle value is found by Equation (3) as m e d i a n j , where j = A ,   C ,   G ,   T :
P r o b ( y j ( i ) m e d i a n j ) = P r o b ( y j ( i ) m e d i a n j ) = 1 2

2.2.2. Entropy-Based Feature Extraction Method

Two entropy-based feature extraction methods including PE and MPE are given in this section. The procedures of obtaining PE and MPE for a given base calling signal are presented below.
● Permutation Entropy
The procedure of measuring PE of a given time series is a process of calculating Shannon entropy (ShEn) with mapping the original series to ordinal patterns. Using ordinal patterns has numerous advantages from different aspects [34]. For a given base calling signal j ( j = A ,   C ,   G ,   T ), the intensities, which exhibit the characteristics of a time series Y j = { y j ( i ) } i = 1 , 2 , , N   with length N , m -dimensional vector, can be expressed as:
y j ( i ) = { y j ( i ) ,   y j ( i + τ ) , , y j ( i + ( m 1 ) τ ) }
where the embedding dimension is denoted by m   ( 2 ) , and time lag is denoted by τ   ( ϵ   ) . Here, y j ( i ) denotes overlapping segments with length m . According to parameter m , the number of possible permutations will be m ! with permutation patterns π p where p = 1 , 2 , , m ! . For each y j ( i ) , Equation (4) can be arranged in ascending order such that:
y j ( i + ( r 1 1 ) τ )   y j ( i + ( r 2 1 ) τ )   y j ( i + ( r m 1 ) τ )
where 1 r i m . Let the probability distribution for each permutation pattern π be shown with P ( π 1 ) ,   P ( π 2 ) , , P ( π k ) where k m ! and satisfy the condition l = 1 k   P ( π l ) = 1 . Based on the ShEn, the PE of order m is now obtained as:
H P E j ( m ) = { π } P ( π l ) l n ( P ( π l ) )
When the relative frequencies of all permutation patterns are equal, the probabilities take the value of 1 m ! , and the maximum value for H P E j ( m ) is obtained as l n ( m ! ) [35,36]. To make H P E j ( m ) scale-independent and comparable among different m , normalized PE ( H N P E j [ 0 , 1 ] ) is calculated by the following equation:
H N P E j = H P E j ( m )   l n ( m ! )
● Multiscale Permutation Entropy
The procedure of measuring MPE for a given intensity Y j = { y j ( i )   } i = 1 , 2 , , N   of base calling signal j ( j = A ,   C ,   G ,   T ) with length N starts with creating a coarse-grained structure. The coarse-grained method introduced by Costa, Goldberg and Peng divides the original time series into non-overlapping windows of increasing length s , also called scale parameter [9]. The z-th element of multiple coarse-grained time series is obtained by:
c j , z ( s ) = 1 s i = ( z 1 ) s + 1 z s y j ( i )
where 1 z N s . Here, N s is the length of the constructed coarse-grained time series. After determining the multiple coarse-grained time series, SampEn is then calculated. Instead of SampEn, Aziz and Arif suggested using PE (given in Equations (5) and (6)) to calculate the complexity of each coarse-grained series C j = { c j ( z ) } z = 1 , 2 , , m   with length m where m -dimensional embedded vector can be expressed as follows [16]:
c j ( z ) = { c j ( z ) ,   c j ( z + τ ) , , c j ( z + ( m 1 ) τ ) }
It should be noted that M P E j reduces to H N P E j when the scale parameter is equal to 1. H N P E j and M P E j are the entropy values of base calling signals’ intensities and calculated for all j = A ,   C ,   G ,   T bases. These entropy measures are used as features that will be included into the SVM classification models.

2.3. Support Vector Machines

Binary class SVM aims to find the most appropriate hyperplane that separates two classes. The training set X with n samples has the form:
X = { ( x 1 , y 1 ) , , ( x n , y n ) ,   x i   R d , y i { 1 , + 1 } }
where x i denotes the set of input vectors, and y i is the set of corresponding labels which has a binary form [37]. The purpose is to estimate the parameters w and b which define the optimal hyperplane obtained from decision function expressed as sign ( f ( x ) ) . Here, f ( x ) is the discriminant function used as the seperating hyperplane and can be defined as follows:
f ( x ) = w T x i + b ,           w R d   and   b R
where the following constraint should be satisfied for this hyperplane:
y i ( w T x i + b ) + 1 ,       i = 1 , , n
A quadratic optimization problem which has the objective function m i n 1 2 w 2 with linear constraints given in Equation (12) is defined in order to obtain a maximum margin band. Using Lagrangian multipliers and Karush-Kuhn-Tucker conditions, the following dual problem can be obtained:
L d = i = 1 n α i 1 2 i = 1 n j = 1 n α i α j y i y j x i T x j s . t .   i = 1 n α i y i = 0   and   α i 0
where x i inputs are named as support vectors corresponding to α i ’s, and the values of α i ’s are found by using one of the quadratic optimization methods for Equation (13). After that, the unknown parameters w and b are determined (for more details, see [38]). The slack variable ( ξ i ) is added to the problem in the case of linearly non-separable data. The value of ξ i represents the total number of misclassifications.
When the data is linearly separable, the linear SVM mentioned above is applied; otherwise non-linear SVM should be preferred. The non-linear SVM outperforms the linear SVM when the complex-structured time series has many features. In non-linear SVM, the inputs are transformed from nonlinear to linear space with a specific kernel function. The aim is to find the hyperplane with the highest margin in the new space where the transformation is successfully achieved by kernels [39]. In the problem, the penalty parameter of the error term is shown by C and the term of C i = 1 n ξ i is added to the object function [40]. After the transformation process of inputs, a linear SVM problem can be formulated for the new space [41]. Also, depending on kernels, Equation (13) is revised as the new dual optimization problem for the non-linear SVM given below:
L d = i = 1 n α i 1 2 i = 1 n j = 1 n α i α j y i y j K ( x i T x j )
where K ( x i T x j ) = ϕ ( x i ) T ϕ ( x i ) is the kernel function. Linear, RBF and Poly. kernels are frequently used in SVM, and the preferability of a kernel over the others is based on expert knowledge and data structure. Table 1 shows the formulations of kernels used in this study, and γ and d are the kernel parameters. While only C parameter can be tuned in linear SVM, γ and d can be tuned in addition to C in RBF and Poly. kernel SVM, respectively.

2.4. Performance Evaluation

Different measures are used in evaluating the performance of SVM models with different kernel functions. Most of these can be derived from a confusion matrix which is a 2 × 2 table that holds information about the predicted versus actual class of observation. A typical confusion matrix is given in Table 2:
In the confusion matrix, TP and TN denote the number of correctly classified HBV and HCV trace files, respectively. Sensitivity (Se), sometimes called the TP rate, indicates the proportion of correctly classified HBV trace files. Analogously, specificity (Sp), also called the TN rate, shows the proportion of correctly classified HCV trace files. Accuracy (Acc) gives the the proportion of overall trace files that are classified correctly. Kappa (κ) statistics is an important agreement measure in the process of assessing the discriminative power of the relevant SVM model. κ statistics lies in the range between [–1,+1] and the perfect classification between HBV and HCV trace files is achieved when κ is found as 1.

2.5. Proposed Framework

The experimental setup of the proposed framework is described in the following steps:
Step 1: Preparing Dataset and Extracting Features
Two hundred trace files belong to Hepatitis DNA are obtained with Phred software. Hepatitis types (96 traces for HBV and 104 traces for HCV) are labeled as +1 and –1 if the related trace represents HBV and HCV, respectively. In order to extract features for the classification process, all trace files that contain four base calling signals are converted to arrays.
In total, 24 features are extracted by two different feature extraction methods, namely statistical-based and entropy-based for SVM classification. Twelve features are obtained in the concept of statistical-based extraction and given in Table 3 where μ j , σ j and m e d i a n j denote mean, standard deviation and median of base calling signal j ( = A ,   C ,   G ,   T ) respectively. The remaining 12 features are extracted with the entropy-based method; four of them are with single scale PE and eight are with multiscale PE. M P E J ( 2 ) and M P E J ( 3 ) demonstrate the multiscale PE of base calling signal j with scale parameters s = 2 and s = 3 , respectively. The higher values of s correlated well with those from the results with s = 2 and s = 3 . For this reason, other values of s are not considered. In addition, since choosing the parameters of embedding dimension m and time lag τ is an important issue which depends on the structure of time series, Bandt and Pompe suggested using the values of m = 3 , 4 , ,   7 and τ = 1 in performing PE and MPE [8]. Also, Nalband, Prince and Agrawal followed this suggestion, using the same values [19]. Thus, these parameters are chosen as m = 3 and τ = 1 . All calculations are carried out using MATLAB 2017a software [33].
Step 2: Creating Training and Testing Dataset
Data is split into train and test sets by random selections with a ratio of 10%, 20%, 30%, 40%, and 50% for each built SVM model in the training process. The grid-search technique on the kernel parameters using 10-fold cross-validation is utilized for the purpose of potentially obtaining a good combination of hyper-parameter values that produce a high generalization performance. The optimal regularization parameter (i.e., C ) and kernel functions parameters (i.e., γ and d ) are searched with defined values: C = (0, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2, 5), γ = (0, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1), and d = (1, 2, 3, 4, 5).
Step 3: Performing Classification Process and Evaluating Results
The classification process of hepatitis DNA trace files is performed by using SVM with three different kernel functions. In this step, features extracted with statistical-based methods are used at first separately, and then together. Likewise, classification is performed using PE (i.e., MPE at s = 1), MPE at s = 2 and MPE at s = 3 features separately, and then together. SVM models using the mentioned features are built for each splitting proportion. Then performance evaluation measures Acc, Se, Sp, and κ are obtained for training and testing pairs. In addition, the number of support vectors (nSV) generated by the training phase of the relevant SVM model is found. This process is run a total of 10 times in order to aviod the random selection process effect. Thus, the performance evaluation measures and nSVs are calculated 10 times for each model. A c c ¯ , S e ¯ , S p ¯ , κ ¯ , and n S V ¯ denote the mean values of Acc, Se, Sp, κ, and nSV, respectively. When training and testing errors are defined as ε t r a i n i n g = 1 A c c ¯ t r a i n i n g and ε t e s t i n g = 1 A c c ¯ t e s t i n g , the error of the relevant SVM model is calculated by ε d i f f = | ε t r a i n i n g ε t e s t i n g | .
“Caret” and “Kernlab” libraries in R studio (version 1.2.1335, RStudio, Inc., Boston, MA, USA) programming language [42,43] are used in step 2 and step 3.

3. Results

3.1. Classification with Statistical-Based Features

Table 4 reports the classification performance of SVM models using statistical-based features for 10%, 20%, 30%, 40%, and 50% training sets and their corresponding testing sets.
When the statistical-based features are taken into account for 10%, 20%, 30%, 40%, and 50% training samples, the SVM-RBF kernel classifier with mean and all statistics features produces better classification performances in terms of both A c c ¯ and κ ¯ . Additionally, the SVM models built with mean and all statistics features in all proportions of training samples indicate high classification accuracies ranging from nearly 93% to 99%. All the SVM models (with linear, Poly. and RBF kernels) for each training sample using median have the lowest classification performances among other statistical-based features. When the difference of error value between training and testing sets approaches to zero, it may be indicated that the model is not suffering from the over-fitting problem. The last column of Table 4 provides ε d i f f , and these values are close to zero in general. On the other hand, Han and Jiang pointed out that the over-fitting problem in classification can be detected by using the expected values of sensitivity and specifity [44]. When these values are complementary, it can be said that the model has an over-fitting problem. It is shown in Table 4 that S e ¯ and S p ¯ take on non-complementary values.

3.2. Classification with Entropy-Based Features

The classification performance of SVM models with entropy-based features for 10%, 20%, 30%, 40%, and 50% training sets and their corresponding testing sets are given in Table 5.
For 10% training samples, SVM-RBF kernel classifier with features of MPE at s = 2 has the highest performance in terms of A c c ¯ (95.6%) and κ ¯ (0.911). For the same training proportion, SVM-RBF kernel classifier with MPE at s = 3 and all entropies have the same values of A c c   ¯ = 95.5% and κ ¯   = 0.909. In the case 20% and 30% training, the highest values of A c c ¯ are obtained with SVM-Poly. kernel classifier that uses all entropies as 96.6% and 98.3%, respectively. Also, this classifier produces the highest value of κ ¯ for 20% and 30% training samples. Results for 40% training samples show that SVM-RBF kernel classifier using all entropy-based features achieves better classification performance in terms of A c c ¯ and κ ¯ (98.9% and 0.978, respectively). Besides, SVM-Poly. kernel classifier with all entropy-based features takes the highest values of A c c ¯ and κ ¯ (98.1% and 0.962, respectively) for 50% training samples. Additionally, SVM models using entropy-based features in all training proportions achieve substantial classification performances where accuracies are ranging from nearly 93% to 99%. According to ε d i f f , S e ¯ and S p ¯ values, it can be concluded that the over-fitting problem does not appear in SVM models for 10%, 20%, 30%, 40%, and 50% training samples. SVM models using entropy-based features indicate very low ε d i f f , ranging from 0.000 and 0.050.

4. Discussion

The characteristic of sequential data exhibits a complex structure. Due to the difficulty of distinguishing this type of data visually, the classification of sequential data has attracted notable attention of researchers in different areas. Most recent studies dealt with the complexity of the system, and therefore, used various types of entropy to extract features from the raw data. Features which reflect the behaviour of data truthfully do not only reduce the dimensionality of space, but also improve the classification quality.
Recent studies for biological systems offered novel approaches to extract features based on single and multiscale entropy measures in order to achieve high classification accuracy. Especially, extracted features from EEG signal-based entropy helps researchers in the early diagnosis of epilepsy, different types of sleep disorders, and brain-related disorders such as Alzheimers [45]. Acharya et al. [46] extracted features from EEG signals by using ApEn, SampEn, and Phase Entropies (S1 and S2) for the purpose of detecting epilepsy. After applying different machine learning classification algorithms, it was shown that fuzzy classifier produced better classification performance (98%) in terms of the performance measures used in the study. Collected EEG signals from the brain were also discriminated with various classifiers after the extraction process, including entropy-based methods (i.e., ApEn and SampEn) in another important study [47]. AverageShEn, Renyi’s (RE), ApEn, SampEn, and S1 and S2 entropies were utilized to extract features from focal and non-focal epilepsy EEG signals in the study of Sharma, Pachori and Acharya [48]. It was reported that the least squares SVM with Morlet wavelet kernel function reached an 87% accuracy rate in classifying signals. For the classification of focal and non-focal EEG, Arunkumar et al. [49] proposed a methodology based on ApEn, SampEn and RE. Extracted features were fed into different classifiers such as NaïveBayes (NBC), SVM, k-nearest neighborhood (KNN), and non-nested generalized exemplars (NNGe). The results demonstrated that NNGe has the best classification performance with 98% accuracy. Also, a review about entropy-based feature extraction methods was presented for the diagnosis of epilepsy in [50]. To detect epileptic seizures, MSE was utilized as the feature extraction method, and SVM classifier was performed in [51]. The classification accuracies in classifying seizure, seizure-free and normal EEG signals were found to be higher than 98%. In sleep scoring classification, features were extracted from EEG signals using MSE, and SVM-based classifiers were performed in [52]. The overall accuracy rate was found to be 91.4%. To make accurate classifications of sleep stages, Rodríguez-Sotelo et al. [53] proposed a method based on J-means classification with EEG features extracted by fractal dimension, detrended fluctuation analysis, ShEn, ApEn, SampEn, and MSE. Extracted features were optimized with Q−α method, and then were fed to J-means classifier which achieved an average of 80% accuracy rate. Another important study which deals with sleep disorders was conducted by using 22 different EEG features including ApEn, SampEn and PE [54]. Extracted features were then fed to Wavelet transform and SVM classifiers. Recent studies have also showed that features extracted from entropy measures produce high classification performance in classifying human sleep EEG signals with different supervised and unsupervised machine learning methods [55,56,57]. To detect Alzheimer’s disease, various EEG features including entropy (e.g., ApEn, SampEn, PE) and statistical (e.g., mean, variance, standard deviation) measures were extracted in [58] and then fed to six classifiers including SVM, artificial neural network, KNN, NBC, and random forest. The proposed method indicated high classification accuracy ranging from nearly 89% to 97%.
Some of the important studies presented above can be seen as pioneers in classifying the sequential data obtained from biological systems and they demonstrated the usefulness of entropy-based feature extraction methods. On the other hand, an increasing number of studies in recent years have investigated the classification abilities of machine learning-based methods for genomic data [59,60]. Genomics is defined as one of the most important domains in bioinformatics [59] where computational methods need to be carefully utilized in order to discover useful but hidden information from biological systems. Extracting a set of features from the bases of DNA and then feeding to any supervised classifier for the purpose of labeling DNA trace files (e.g., high/low quality, genotyping of the viruses, species identification) is an important step to achieve high classification accuracy, as in all classification paradigms. To the best of our knowledge, there is no work which deals with entropy-based feature extraction methods for gene sequencing data. In this study, a new framework is proposed to classify hepatitis DNA trace files with SVM using extraction methods based on both statistics and entropy (i.e., PE and MPE) measures. The mathematical formulations of two extraction methods are introduced. The offered extraction methods are applied for the hepatitis DNA trace files and hence, the classification of the files as HBV and HCV is performed via SVM with three different kernel functions.
SVM models built with median features have low accuracies compared to models with other statistical-based features. In general, SVM-RBF kernel classifier using mean and all statistics features outperforms SVM models with other statistical-based features. On the other hand, SVM-RBF or SVM-Poly. kernel classifiers using all entropies achieve higher classification performances than SVM-linear classifier for all training samples except 10%. SVM models using both statistical and entropy-based features exhibit very close classification performances in terms of accuracies.
When the best-performing SVM models for each training proportions are compared, it is found that the models with entropy-based features produce lower nSVs than models with statistical-based features and consequently yield lower complexity in the decision process.
According to Table 5, SVM-RBF kernel classifiers with entropy-based features have a higher percentage of nSVs compared with SVM-linear and SVM-Poly. kernel classifiers for all training proportions. Therefore, one can conclude that an over-fitting problem can appear. On the contrary, for each training proportion, it is found that ε diff values are close to 0, Se ¯ values are close to 1, and Sp ¯ values are above 0.90. In addition, Se ¯ and Sp ¯ do not take complementary values. Thus, according to these values, it is not expected that an over-fitting problem can arise. On the other hand, SVM models using all entropies have lower nSVs compared with models using PE, MPE at s = 2 and MPE at s = 3 separately for training proportions from 30% to 50%. Thus, it can be concluded that less parameters are enough to define hyperplanes of problem complexity in the situation of SVM using all entropies. Moreover, cross-validation utilized in the training phase also contributes to overcome the over-fitting problem.

5. Conclusions

The results demonstrate that the proposed framework produces remarkable classification performances based on both statistical and entropy features. By integrating this framework into the DNA sequencing devices, autonomous classification of DNA trace files, especially hepatitis DNA trace files that cannot be distinguished visually, can be achieved successfully.
The proposed framework, which offers two different feature extraction methods, demonstrates that SVM models with statistical-based features have high performance as well as models with entropy-based features. Hence, it is suggested that entropies can be effectively used in the extraction of features from DNA trace files which produce non-stationary, noisy and non-linear signals. This feature extraction method can be considered either alone or combined with other extraction methods with the purpose of obtaining higher classification performance.
Although this study is designed for the classification of two class trace files (HBV and HCV), further studies can be concentrated on multi-class trace files such as the genotypes (sub-types) of hepatitis, other viruses and bacteria DNA trace files. Also, different supervised machine learning methods can be implemented and compared in terms of their classification ability.

Author Contributions

Conceptualization, E.Ö. and Ö.E.A.; Methodology, E.Ö. and Ö.E.A.; Software, E.Ö.; Writing—original draft, E.Ö. and Ö.E.A.; Writing—review & editing, E.Ö. and Ö.E.A.

Ethical Approval

T.C. Biruni University ethics committee; 03.29.2019, number: 2019/27-29.

Funding

There is no funding source for this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Furey, T.S.; Cristianini, N.; Duffy, N.; Bednarski, D.W.; Schummer, M.; Haussler, D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics 2000, 16, 906–914. [Google Scholar] [CrossRef]
  2. Lander, E.S.; Linton, L.M.; Birren, B.; Nusbaum, C.; Zody, M.C.; Baldwin, J.; Devon, K.; Dewar, K.; Doyle, M.; Fitzhugh, W.; et al. Erratum: Initial sequencing and analysis of the human genome: International Human Genome Sequencing Consortium. Nature 2001, 409, 860–921. [Google Scholar] [PubMed]
  3. Mateos, A.; Dopazo, J.; Jansen, R.; Tu, Y.; Gerstein, M.; Stolovitzky, G. Systematic learning of gene functional classes from DNA array expression data by using multilayer perceptrons. Genome Res. 2002, 12, 1703–1715. [Google Scholar] [CrossRef] [PubMed]
  4. Öz, E.; Kaya, H. Support vector machines for quality control of DNA sequencing. J. Inequalities Appl. 2013, 2013, 85. [Google Scholar] [CrossRef]
  5. Pincus, S.M. Approximate entropy as a measure of system complexity. Proc. Natl. Acad. Sci. USA 1991, 88, 2297–2301. [Google Scholar] [CrossRef] [PubMed]
  6. Richman, J.S.; Moorman, J.R. Physiological time-series analysis using approximate entropy and sample entropy. Am. J. Physiol. Heart Circ. Physiol. 2000, 278, H2039–H2049. [Google Scholar] [CrossRef] [PubMed]
  7. Li, X.; Ouyang, G.; Richards, D.A. Predictability analysis of absence seizures with permutation entropy. Epilepsy Res. 2007, 77, 70–74. [Google Scholar] [CrossRef]
  8. Bandt, C.; Pompe, B. Permutation entropy: A natural complexity measure for time series. Phys. Rev. Lett. 2002, 88, 174102. [Google Scholar] [CrossRef]
  9. Costa, M.; Goldberger, A.L.; Peng, C.K. Multiscale entropy analysis of complex physiologic time series. Phys. Rev. Lett. 2002, 89, 068102. [Google Scholar] [CrossRef]
  10. Costa, M.; Goldberger, A.L.; Peng, C.K. Multiscale entropy to distinguish physiologic and synthetic RR time series. Comput. Cardiol. 2002, 29, 137–140. [Google Scholar]
  11. Costa, M.; Goldberger, A.L.; Peng, C.K. Multiscale entropy analysis of biological signals. Phys. Rev. E 2005, 71, 021906. [Google Scholar] [CrossRef] [PubMed]
  12. Costa, M.; Peng, C.K.; Goldberger, A.L.; Hausdorff, J.M. Multiscale entropy analysis of human gait Dynamics. Phys. A 2003, 330, 53–60. [Google Scholar] [CrossRef]
  13. Humeau-Heurtier, A. The multiscale entropy algorithm and its variants: A review. Entropy 2015, 17, 3110–3123. [Google Scholar] [CrossRef]
  14. Nikulin, V.V.; Brismar, T. Comment on “Multiscale entropy analysis of complex physiologic time series”. Phys. Rev. Lett. 2004, 92, 089803. [Google Scholar] [CrossRef] [PubMed]
  15. Wu, S.D.; Wu, C.W.; Lee, K.Y.; Lin, S.G. Modified multiscale entropy for short-term time series analysis. Phys. A 2013, 392, 5865–5873. [Google Scholar] [CrossRef]
  16. Aziz, W.; Arif, M. Multiscale permutation entropy of physiological time series. In Proceedings of the 9th International Multitopic Conference (INMIC ’05), Karachi, Pakistan, 24–25 December 2005; pp. 1018–1021. [Google Scholar]
  17. Ravelo-García, A.; Navarro-Mesa, J.L.; Casanova-Blancas, U.; Martin-Gonzalez, S.; Quintana-Morales, P.; Guerra-Moreno, I.; Canino-Rodríguez, J.M.; Hernández-Pérez, E. Application of the permutation entropy over the heart rate variability for the improvement of electrocardiogram-based sleep breathing pause detection. Entropy 2015, 17, 914–927. [Google Scholar] [CrossRef]
  18. Nalband, S.; Sundar, A.; Prince, A.A.; Agrawal, A. Feature selection and classification methodology for the detection of knee-joint disorders. Comput. Methods Progr. Biomed. 2016, 127, 94–104. [Google Scholar] [CrossRef]
  19. Nalband, S.; Prince, A.A.; Agrawal, A. Entropy-based feature extraction and classification of vibroarthographic signal using complete ensemble empirical mode decomposition with adaptive noise. IET Sci. Meas. Technol. 2018, 12, 350–359. [Google Scholar] [CrossRef]
  20. Nicolaou, N.; Georgiou, J. Detection of epileptic electroencephalogram based on permutation entropy and support vector machines. Expert Syst. Appl. 2012, 39, 202–209. [Google Scholar] [CrossRef]
  21. Ocak, H. Optimal classification of epileptic seizures in EEG using wavelet analysis and genetic algorithm. Signal Process. 2008, 88, 1858–1867. [Google Scholar] [CrossRef]
  22. Song, Y.; Lio, P. A new approach for epileptic seizure detection: Sample entropy based feature extraction and extreme learning machine. J. Biomed. Sci. Eng. 2010, 6, 556–567. [Google Scholar] [CrossRef]
  23. Labate, D.; Palamara, I.; Mammone, N.; Morabito, G.; La Foresta, F.; Morabito, F.C. SVM classification of epileptic EEG recordings through multiscale permutation entropy. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Dallas, TX, USA, 4–9 August 2013; pp. 1–5. [Google Scholar]
  24. Wu, S.D.; Wu, P.H.; Wu, C.W.; Ding, J.J.; Wang, C.C. Bearing fault diagnosis based on multiscale permutation entropy and support vector machines. Entropy 2012, 14, 1343–1356. [Google Scholar] [CrossRef]
  25. Fung, G.; Mangasarian, O.L.; Shavlik, J.W. Knowledge-based support vector machine classifiers. In Advances in Neural Information Processing Systems; Becker, S., Thrun, S., Obermayer, K., Eds.; MIT Press: Cambridge, MA, USA, 2003. [Google Scholar]
  26. Öz, E.; Kurt, S.; Asyalı, M.; Yücel, Y. Feature based quality assessment of DNA sequencing chromatograms. Appl. Soft Comput. 2016, 41, 420–427. [Google Scholar] [CrossRef]
  27. Kurt, S.; Öz, E.; Aşkın, Ö.E.; Yücel, Y. Classification of nucleotide sequences for quality assessment using logistic regression and decision tree approaches. Neural Comput. Appl. 2018, 29, 251–261. [Google Scholar] [CrossRef]
  28. Seo, T.K. Classification of nucleotide sequences using support vector machines. J. Mol. Evol. 2010, 71, 250–267. [Google Scholar] [CrossRef]
  29. Cortes, C.; Vapnik, V. Support vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  30. Vapnik, V.N. Statistical Learning Theory; Wiley: New York, NY, USA, 1998. [Google Scholar]
  31. Bhat, H.F. Evaluating SVM algorithms for bioinformatic gene expression analysis. Int. J. Comp. Sci. Eng. 2017, 6, 42–52. [Google Scholar]
  32. Ewing, B.; Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998, 8, 186–194. [Google Scholar] [CrossRef]
  33. MATLAB, Version 9.2.0; The MathWorks Inc.: Natick, MA, USA, 2017.
  34. Zunino, L.; Olivares, F.; Scholkmann, F.; Rosso, O.A. Permutation entropy based time series analysis: Equalities in the input signal can lead to false conclusions. Phys. Lett. A 2017, 381, 1883–1892. [Google Scholar] [CrossRef]
  35. Yan, R.; Liu, Y.; Gao, R.X. Permutation entropy: A nonlinear statistical measure for status characterization of rotary machines. Mech. Syst. Signal Proc. 2012, 29, 474–484. [Google Scholar] [CrossRef]
  36. Riedl, M.; Müller, A.; Wessel, N. Practical considerations of permutation entropy. Eur. Phys. J. Spec. Top. 2013, 222, 249–262. [Google Scholar] [CrossRef]
  37. Campbell, C.; Ying, Y. Learning with Support Vector Machines; Morgan & Claypool Publishers: San Rafael, CA, USA, 2011. [Google Scholar]
  38. Alpaydin, E. Introduction to Machine Learning; MIT Press: Cambridge, UK, 2004. [Google Scholar]
  39. Yue, S.; Li, P.; Hao, P. SVM classification: Its contents and challenges. Appl. Math. J. Chin. Univ. 2003, 18, 332–342. [Google Scholar] [CrossRef]
  40. Hsu, C.W.; Chang, C.C.; Lin, C.J. A Practical Guide to Support Vector Classification; Technical Report; Department of Computer Science and Information Engineering, National Taiwan University: Taipei City, Taiwan, 2004; Available online: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf (accessed on 2 November 2019).
  41. Cherkassky, V.; Mulier, F.M. Learning from Data: Concepts, Theory, and Methods; Wiley-Interscience: New York, NY, USA, 1998. [Google Scholar]
  42. Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef] [Green Version]
  43. Karatzoglou, A.; Smola, A.; Hornik, K.; Zeileis, A. Kernlab—An S4 Package for Kernel Methods in R. J. Stat. Softw. 2004, 11, 1–20. [Google Scholar] [CrossRef] [Green Version]
  44. Han, H.; Jiang, X. Overcome support vector machine diagnosis overfitting. Cancer Inform. 2014, 13, CIN-S13875. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  45. Amarantidis, L.C.; Abásolo, D. Interpretation of entropy algorithms in the context of biomedical signal analysis and their application to EEG analysis in epilepsy. Entropy 2019, 21, 840. [Google Scholar] [CrossRef] [Green Version]
  46. Acharya, U.R.; Molinari, F.; Sree, S.V.; Chattopadhyay, S.; Ng, K.H.; Suri, J.S. Automated diagnosis of epileptic EEG using entropies. Biomed. Signal Process. Control 2012, 7, 401–408. [Google Scholar] [CrossRef] [Green Version]
  47. Acharya, U.R.; Sree, S.V.; Ang, P.C.A.; Yanti, R.; Suri, J.S. Application of non-linear and wavelet based features for the automated identification of epileptic EEG signals. Int. J. Neural Syst. 2012, 22, 1250002. [Google Scholar] [CrossRef]
  48. Sharma, R.; Pachori, R.B.; Acharya, U.R. Application of entropy measures on intrinsic mode functions for the automated identification of focal electroencephalogram signals. Entropy 2015, 17, 669–691. [Google Scholar] [CrossRef]
  49. Arunkumar, N.; Ramkumar, K.; Venkatraman, V.; Abdulhay, E.; Fernandes, S.L.; Kadry, S.; Segal, S. Classification of focal and non focal EEG using entropies. Pattern Recognit. Lett. 2017, 94, 112–117. [Google Scholar]
  50. Acharya, U.R.; Fujita, H.; Sudarshan, V.K.; Bhat, S.; Koh, J.E. Application of entropies for automated diagnosis of epilepsy using EEG signals: A review. Knowl. Base Syst. 2015, 88, 85–96. [Google Scholar] [CrossRef]
  51. Bhattacharyya, A.; Pachori, R.B.; Upadhyay, A.; Acharya, U.R. Tunable-Q wavelet transform based multiscale entropy measure for automated classification of epileptic EEG signals. Appl. Signal Process. Meth. Syst. Anal. Physiol. Health 2017, 7, 385. [Google Scholar] [CrossRef] [Green Version]
  52. Tian, P.; Hu, J.; Qi, J.; Ye, X.; Che, D.; Ding, Y.; Peng, Y. A hierarchical classification method for automatic sleep scoring using multiscale entropy features and proportion information of sleep architecture. Biocybern. Biomed. Eng. 2017, 37, 263–271. [Google Scholar] [CrossRef]
  53. Rodríguez-Sotelo, J.L.; Osorio-Forero, A.; Jiménez-Rodríguez, A.; Cuesta-Frau, D.; Cirugeda-Roldán, E.; Peluffo, D. Automatic sleep stages classification using EEG entropy features and unsupervised pattern analysis techniques. Entropy 2014, 16, 6573–6589. [Google Scholar] [CrossRef] [Green Version]
  54. Zhao, D.; Wang, Y.; Wang, Q.; Wang, X. Comparative analysis of different characteristics of automatic sleep stages. Comput. Methods Programs Biomed. 2019, 175, 53–72. [Google Scholar] [CrossRef]
  55. Michielli, N.; Acharya, U.R.; Molinari, F. Cascaded LSTM recurrent neural network for automated sleep stage classification using single-channel EEG signals. Comp. Biol. Med. 2019, 106, 71–81. [Google Scholar] [CrossRef] [PubMed]
  56. Vimala, V.; Ramar, K.; Ettappan, M. An intelligent sleep apnea classification system based on EEG signals. J. Med. Syst. 2019, 43, 36. [Google Scholar] [CrossRef]
  57. Wang, Q.; Zhao, D.; Wang, Y.; Hou, X. Ensemble learning algorithm based on multi-parameters for sleep staging. Med. Biol. Eng. Comput. 2019, 57, 1693–1707. [Google Scholar] [CrossRef]
  58. Tzimourta, K.D.; Giannakeas, N.; Tzallas, A.T.; Astrakas, L.G.; Afrantou, T.; Ioannidis, P.; Grigoriadis, N.; Angelidis, P.; Tsalikakis, D.G.; Tsipouras, M.G. EEG window length evaluation for the detection of Alzheimer’s disease over different brain regions. Brain Sci. 2019, 9, 81. [Google Scholar] [CrossRef] [Green Version]
  59. Larrañaga, P.; Calvo, B.; Santana, R.; Bielza, C.; Galdiano, J.; Inza, I.; Lozano, J.A.; Armañanzas, R.; Santafé, G.; Pérez, A.; et al. Machine learning in bioinformatics. Brief Bioinform. 2006, 7, 86–112. [Google Scholar] [CrossRef] [Green Version]
  60. Plewczynski, D.; Tkacz, A.; Wyrwicz, L.S.; Rychlewski, L.; Ginalski, K. AutoMotif server for prediction of phosphorylation sites in proteins using support vector machine: 2007 update. J. Mol. Modeling 2008, 14, 69–76. [Google Scholar] [CrossRef] [PubMed]
Figure 1. A sample of Hepatitis B Virus (HBV) trace file.
Figure 1. A sample of Hepatitis B Virus (HBV) trace file.
Entropy 21 01149 g001
Figure 2. A sample of Hepatitis C Virus (HCV) trace file.
Figure 2. A sample of Hepatitis C Virus (HCV) trace file.
Entropy 21 01149 g002
Figure 3. A sample of HBV trace file.
Figure 3. A sample of HBV trace file.
Entropy 21 01149 g003
Table 1. Kernel functions.
Table 1. Kernel functions.
Kernel K ( x i T x j )
Linear x i T x j
Radial basis function exp ( γ x i x j 2 ) , γ > 0
Polynomial ( x i T x j + 1 ) d , γ > 0
Table 2. Confusion matrix.
Table 2. Confusion matrix.
Classifier Prediction Value
PositiveNegative
Actual ValuePositiveTrue positives (TP)False negatives (FN)
NegativeFalse positives (FP)True negatives (TN)
Table 3. Feature Descriptions.
Table 3. Feature Descriptions.
FeatureBase Calling Signal
MethodDescriptionAdenine Cytosine Guanine Thymine
Statistical BasedMean μ A μ C μ G μ T
Median median A median C median G median T
Standard Deviation σ A σ C σ G σ T
Entropy BasedPE HNPE A HNPE C HNPE G HNPE T
MPE with s = 2 MPE A ( 2 ) MPE C ( 2 ) MPE G ( 2 ) MPE T ( 2 )
MPE with s = 3 MPE A ( 3 ) MPE C ( 3 ) MPE G ( 3 ) MPE T ( 3 )
Table 4. Overall performance measures of classifications using statistical-based features.
Table 4. Overall performance measures of classifications using statistical-based features.
FeatureSVMTraining (10%)Testing ε d i f f
A c c ¯ κ ¯ S e ¯ S p ¯ n S V ¯ A c c ¯ κ ¯ S e ¯ S p ¯
MeanLinear0.9830.9660.9691.00010.90.9610.9230.9270.9990.022
Poly. Kernel0.9700.9400.9451.00011.80.9600.9210.9241.0000.010
RBF Kernel0.9950.9890.9921.00015.30.9800.9600.9870.9730.015
MedianLinear0.7930.5320.6690.86112.90.7080.4250.5900.8440.085
Poly. Kernel0.9050.7820.7611.00010.20.7720.5550.6370.9280.133
RBF Kernel0.8250.6180.7400.89117.60.7180.4420.6130.8360.107
Standard DeviationLinear0.9580.9020.9370.96910.70.9030.8090.8540.9580.055
Poly. Kernel0.9800.9570.9790.9748.90.9220.8460.8710.9790.058
RBF Kernel0.9700.9320.9720.95815.10.9630.9270.9670.9600.007
All StatisticsLinear0.9920.9840.9850.9998.90.9530.9060.9130.9960.039
Poly. Kernel0.9850.9690.9751.00010.70.9380.8780.8850.9960.047
RBF Kernel0.9900.9791.0000.97715.50.9720.9450.9940.9480.018
Training (20%)Testing ε diff
MeanLinear0.9810.9610.9631.00018.40.9670.9350.9380.9990.014
Poly. Kernel0.9970.9950.9951.00019.40.9640.9280.9340.9960.033
RBF Kernel0.9950.9890.9911.00030.40.9830.9660.9900.9750.012
MedianLinear0.8010.6040.6210.99122.50.7700.5470.5850.9710.031
Poly. Kernel0.9050.7960.8920.89019.20.8340.6710.8120.8630.071
RBF Kernel0.8320.6610.6740.98227.20.7730.5540.6200.9450.059
Standard DeviationLinear0.9550.9080.9320.97817.70.9310.8630.8840.9820.024
Poly. Kernel0.9870.9731.0000.97112.70.9470.8950.9260.9700.040
RBF Kernel0.9900.9790.9930.98526.00.9640.9280.9760.9510.026
All StatisticsLinear0.9890.9780.9791.00017.00.9700.9410.9450.9980.019
Poly. Kernel0.9900.9790.9791.00016.30.9700.9410.9450.9980.020
RBF Kernel0.9970.9941.0000.99330.10.9750.9500.9950.9530.022
Training (30%)Testing ε diff
MeanLinear0.9860.9720.9741.00022.70.9680.9370.9391.0000.018
Poly. Kernel0.9950.9890.9901.00016.80.9760.9520.9580.9950.019
RBF Kernel0.9910.9830.9831.00043.20.9840.9680.9910.9760.007
MedianLinear0.8070.6170.6380.98931.00.7840.5740.6080.9760.023
Poly. Kernel0.9210.8390.8970.93420.40.8390.6810.8090.8790.082
RBF Kernel0.8460.6960.7190.99036.60.7640.5330.5640.9770.082
Standard DeviationLinear0.9540.9080.9250.98425.50.9360.8720.8970.9780.018
Poly. Kernel0.9880.9750.9930.98014.60.9550.9090.9470.9630.033
RBF Kernel0.9850.9690.9870.98231.60.9720.9450.9860.9580.013
All StatisticsLinear0.9900.9800.9811.00021.30.9760.9520.9540.9990.014
Poly. Kernel0.9950.9890.9891.00022.10.9720.9450.9520.9950.023
RBF Kernel0.9960.9930.9960.99641.50.9800.9590.9950.9620.016
Training (40%)Testing ε diff
MeanLinear0.9840.9690.9701.00025.00.9700.9410.9441.0000.014
Poly. Kernel0.9900.9790.9850.99429.10.9800.9600.9630.9980.010
RBF Kernel0.9930.9870.9871.00057.30.9900.9790.9910.9870.003
MedianLinear0.8180.6360.6510.99038.50.7890.5860.6240.9750.029
Poly. Kernel0.9160.8290.9310.89429.00.8400.6820.8550.8290.076
RBF Kernel0.8280.6610.6840.98948.00.8270.6560.6780.9840.001
Standard DeviationLinear0.9540.9070.9230.98733.20.9320.8650.8900.9790.022
Poly. Kernel0.9950.9890.9970.99117.10.9680.9360.9710.9640.027
RBF Kernel0.9860.9720.9890.98133.10.9740.9480.9820.9650.012
All StatisticsLinear0.9920.9840.9851.00023.80.9750.9510.9540.9990.017
Poly. Kernel0.9950.9890.9901.00021.00.9810.9630.9670.9960.014
RBF Kernel0.9960.9920.9950.99741.50.9840.9680.9830.9840.012
Training (50%)Testing ε diff
MeanLinear0.9870.9740.9751.00025.40.9710.9410.9431.0000.016
Poly. Kernel0.9920.9830.9860.99829.80.9800.9590.9680.9930.012
RBF Kernel0.9930.9850.9900.99573.40.9890.9770.9880.9880.004
MedianLinear0.8170.6390.6550.99547.80.8030.6110.6320.9870.014
Poly. Kernel0.9130.8230.9590.85734.40.8670.7350.9000.8370.046
RBF Kernel0.8410.6790.6771.00054.40.8020.6140.6430.9920.039
Standard DeviationLinear0.9490.8980.9190.98141.10.9370.8740.9000.9770.012
Poly. Kernel0.9840.9670.9900.97632.20.9700.9390.9650.9740.014
RBF Kernel0.9870.9730.9950.97735.70.9680.9350.9770.9540.019
All StatisticsLinear0.9950.9900.9911.00021.30.9800.9610.9630.9990.015
Poly. Kernel0.9980.9950.9961.00018.00.9790.9570.9660.9920.019
RBF Kernel0.9960.9910.9960.99537.80.9910.9810.9890.9900.005
Table 5. Overall performance measures of classifications using entropy-based features.
Table 5. Overall performance measures of classifications using entropy-based features.
FeatureSVMTraining (10%)Testing ε d i f f
A c c ¯ κ ¯ S e ¯ S p ¯ n S V ¯ A c c ¯ κ ¯ S e ¯ S p ¯
PELinear0.9440.8800.9840.89410.80.9330.8670.9570.9090.011
Poly. Kernel0.9600.9191.0000.9218.40.9500.9000.9770.9210.010
RBF Kernel0.9650.9281.0000.93116.10.9500.9000.9940.9020.015
MPE with s = 2Linear0.9540.9040.9950.91110.80.9410.8820.9730.9050.013
Poly. Kernel0.9950.9901.0000.9909.70.9450.8900.9540.9350.050
RBF Kernel0.9450.8901.0000.89415.50.9560.9111.0000.9090.011
MPE with s = 3Linear0.9490.8940.9840.90911.30.9370.8740.9630.9090.012
Poly. Kernel0.9800.9591.0000.9588.50.9350.8710.9340.9370.045
RBF Kernel0.9550.9051.0000.89215.60.9550.9091.0000.9070.000
All EntropiesLinear0.9540.9030.9870.91510.10.9450.8900.9810.9050.009
Poly. Kernel0.9800.9561.0000.95011.00.9540.9080.9870.9190.026
RBF Kernel0.9700.9381.0000.94015.70.9550.9090.9940.9110.015
Training (20%)Testing ε d i f f
PELinear0.9450.8890.9900.89720.60.9490.8990.9880.9080.004
Poly. Kernel0.9800.9591.0000.95612.70.9460.8930.9530.9400.034
RBF Kernel0.9550.9061.0000.89631.40.9530.9071.0000.9050.002
MPE with s = 2Linear0.9460.8890.9920.89421.20.9500.9010.9890.9090.004
Poly. Kernel0.9770.9540.9950.95816.10.9470.8940.9890.9010.030
RBF Kernel0.9550.9091.0000.90732.90.9550.9091.0000.9060.000
MPE with s = 3Linear0.9480.8940.9920.90420.40.9500.9000.9890.9070.002
Poly. Kernel0.9700.9380.9910.95216.90.9490.8970.9820.9120.021
RBF Kernel0.9650.9280.9930.93331.20.9410.8830.9720.9100.024
All EntropiesLinear0.9520.9020.9890.91119.80.9500.9000.9880.9090.002
Poly. Kernel0.9870.9711.0000.96410.60.9660.9320.9770.9550.021
RBF Kernel0.9750.9481.0000.94734.40.9500.8990.9950.9010.025
Training (30%)Testing ε d i f f
PELinear0.9490.8970.9890.90629.40.9480.8960.9870.9060.001
Poly. Kernel0.9860.9731.0000.97217.50.9570.9150.9700.9440.029
RBF Kernel0.9560.9111.0000.90549.90.9540.9081.0000.9060.002
MPE with s = 2Linear0.9530.9050.9930.90928.60.9490.8970.9900.9050.004
Poly. Kernel0.9960.9930.9960.99511.70.9640.9280.9700.9590.032
RBF Kernel0.9500.8961.0000.88850.20.9560.9121.0000.9110.006
MPE with s = 3Linear0.9500.8990.9930.90430.70.9500.8990.9890.9070.000
Poly. Kernel0.9810.9620.9890.97215.80.9370.8740.9180.9610.044
RBF Kernel0.9630.9241.0000.91750.00.9510.9021.0000.9010.012
All EntropiesLinear0.9600.9200.9910.92828.20.9480.8960.9900.9010.012
Poly. Kernel0.9960.9961.0000.99210.60.9830.9670.9830.9830.013
RBF Kernel0.9830.9651.0000.96336.10.9700.9391.0000.9370.013
Training (40%)Testing ε d i f f
PELinear0.9510.9010.9890.90939.00.9470.8930.9870.9030.004
Poly. Kernel0.9900.9790.9950.98213.50.9690.9380.9680.9700.021
RBF Kernel0.9480.8961.0000.89170.00.9590.9171.0000.9140.011
MPE with s = 2Linear0.9510.9010.9940.90537.40.9500.8990.9890.9060.001
Poly. Kernel0.9960.9921.0000.99114.60.9610.9230.9600.9630.035
RBF Kernel0.9480.8971.0000.89868.80.9590.9171.0000.9120.011
MPE with s = 3Linear0.9490.8960.9910.90340.70.9510.9010.9900.9080.002
Poly. Kernel0.9760.9510.9950.95221.30.9520.9040.9710.9330.024
RBF Kernel0.9560.9111.0000.90468.40.9540.9081.0000.9070.002
All EntropiesLinear0.9640.9270.9940.92830.30.9520.9040.9900.9120.012
Poly. Kernel0.9930.9871.0000.98614.00.9770.9540.9870.9670.016
RBF Kernel0.9960.9921.0000.99129.30.9890.9781.0000.9770.007
Training (50%)Testing ε d i f f
PELinear0.9480.8950.9900.90349.50.9510.9010.9890.9090.003
Poly. Kernel0.9920.9830.9950.98711.10.9600.9190.9710.9480.032
RBF Kernel0.9590.9171.0000.91676.60.9550.9081.0000.9030.004
MPE with s = 2Linear0.9520.9040.9960.90541.40.9500.9000.9890.9070.002
Poly. Kernel0.9960.9910.9960.99513.40.9720.9430.9660.9770.024
RBF Kernel0.9470.8931.0000.89384.40.9630.9241.0000.9190.016
MPE with s = 3Linear0.9510.9020.9920.90746.60.9490.8980.9910.9040.002
Poly. Kernel0.9840.9670.9960.96916.00.9460.8910.9600.9310.038
RBF Kernel0.9540.9071.0000.90386.40.9560.9111.0000.9090.002
All EntropiesLinear0.9660.9320.9940.93530.80.9560.9120.9910.9190.010
Poly. Kernel0.9990.9971.0000.99714.60.9810.9620.9910.9710.018
RBF Kernel0.9880.9751.0000.97541.90.9790.9571.0000.9530.009

Share and Cite

MDPI and ACS Style

Öz, E.; Aşkın, Ö.E. Classification of Hepatitis Viruses from Sequencing Chromatograms Using Multiscale Permutation Entropy and Support Vector Machines. Entropy 2019, 21, 1149. https://doi.org/10.3390/e21121149

AMA Style

Öz E, Aşkın ÖE. Classification of Hepatitis Viruses from Sequencing Chromatograms Using Multiscale Permutation Entropy and Support Vector Machines. Entropy. 2019; 21(12):1149. https://doi.org/10.3390/e21121149

Chicago/Turabian Style

Öz, Ersoy, and Öyküm Esra Aşkın. 2019. "Classification of Hepatitis Viruses from Sequencing Chromatograms Using Multiscale Permutation Entropy and Support Vector Machines" Entropy 21, no. 12: 1149. https://doi.org/10.3390/e21121149

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop