Next Article in Journal
Efficient (3S)-Acetoin and (2S,3S)-2,3-Butanediol Production from meso-2,3-Butanediol Using Whole-Cell Biocatalysis
Next Article in Special Issue
Prediction of Protein-Protein Interactions from Amino Acid Sequences Based on Continuous and Discrete Wavelet Transform Features
Previous Article in Journal
Screening of Genes Related to Early and Late Flowering in Tree Peony Based on Bulked Segregant RNA Sequencing and Verification by Quantitative Real-Time PCR
Previous Article in Special Issue
RPiRLS: Quantitative Predictions of RNA Interacting with Any Protein of Known Sequence
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Representation Learning for Class C G Protein-Coupled Receptors Classification

1
Computer Science Institute, Technological University of the Mixteca Region, 69000 Huajuapan, Oaxaca, Mexico
2
Laboratory of Molecular Neuropharmacology and Bioinformatics, Institut de Neurociències and Unitat de Bioestadística, Universitat Autònoma de Barcelona, 08193 Bellaterra, Spain
3
Network Biomedical Research Center on Mental Health (CIBERSAM), Universitat Autònoma de Barcelona, 08193 Bellaterra, Spain
*
Authors to whom correspondence should be addressed.
Molecules 2018, 23(3), 690; https://doi.org/10.3390/molecules23030690
Submission received: 27 February 2018 / Revised: 14 March 2018 / Accepted: 15 March 2018 / Published: 19 March 2018
(This article belongs to the Special Issue Computational Analysis for Protein Structure and Interaction)

Abstract

:
G protein-coupled receptors (GPCRs) are integral cell membrane proteins of relevance for pharmacology. The complete tertiary structure including both extracellular and transmembrane domains has not been determined for any member of class C GPCRs. An alternative way to work on GPCR structural models is the investigation of their functionality through the analysis of their primary structure. For this, sequence representation is a key factor for the GPCRs’ classification context, where usually, feature engineering is carried out. In this paper, we propose the use of representation learning to acquire the features that best represent the class C GPCR sequences and at the same time to obtain a model for classification automatically. Deep learning methods in conjunction with amino acid physicochemical property indices are then used for this purpose. Experimental results assessed by the classification accuracy, Matthews’ correlation coefficient and the balanced error rate show that using a hydrophobicity index and a restricted Boltzmann machine (RBM) can achieve performance results (accuracy of 92.9%) similar to those reported in the literature. As a second proposal, we combine two or more physicochemical property indices instead of only one as the input for a deep architecture in order to add information from the sequences. Experimental results show that using three hydrophobicity-related index combinations helps to improve the classification performance (accuracy of 94.1%) of an RBM better than those reported in the literature for class C GPCRs without using feature selection methods.

1. Introduction

G protein coupled receptors (GPCRs) are integral cell membrane proteins responsible for translating the molecular signals encoded in the chemical structure of hormones and neurotransmitters from outside to inside the cell. GPCRs share a common structure consisting of seven transmembrane helices (7TM), which are linked by three extracellular and three intracellular loops [1]. The binding of endogenous or synthetic agonists causes the activation of the receptor, which results in conformational changes that allow the allosteric coupling of accessory proteins such as G protein or β -arrestin at the intracellular part of the receptor [2,3]. Activation of these accessory proteins triggers the series of steps that constitute the signal transduction mechanism, which eventually lead to the observed physiological responses. The human GPCRs have been classified into five main families or classes (glutamate or class C, rhodopsin or class A, adhesion, frizzled or class F/taste2 and secretin or class B) by phylogenetic analysis [4]. Crystallographic determinations of a number of ligand-GPCR complexes have provided insights into the recognition determinants that discriminate between agonists (activators) and antagonists (inhibitors) [5], whereas other techniques such as nuclear magnetic resonance (NMR) [6], fluorescence approaches [7] and molecular dynamics (MD) [8] have led to mechanistic proposals for receptor activation and the allosteric transmission of the signal from the ligand binding site to the G protein or β -arrestin binding sites of the receptor.
GPCRs are at the center of current drug discovery programs. As of November 2017, approximately 35% of approved drugs in the United States or European Union target GPCRs [9]. There are different criteria for therapeutic drug design. One is selectivity, as it seems appropriate that drugs act selectively through specific receptors. Another is the concept of receptor polypharmacology in which a drug exerts a combination of positive effects by binding to different receptors [10]. Notwithstanding the approach that is followed, the correct classification of receptors in public databases is fundamental for virtual screening studies and in the examination of receptor functionality in general. To this aim, machine learning methods have proven to be useful [11,12,13,14,15,16,17]. For this, the standard procedure follows a feature extraction stage, where many ad hoc representations designed by specific domain experts can be used, and then, a classification stage is utilized. For the first stage, there are two main approaches to analyze GPCR sequences in order to extract the inherent features of the original sequences: multiple alignment and alignment-free representations. Many methods of both techniques have been developed in the literature achieving good representations, which are confirmed by the corresponding classification results [11,12,13,14,18,19,20]. However, the obtained/extracted representations are domain-dependent, which considers only certain factors (as frequency, order, etc.) of the original sequences.
In recent years, the representation learning field has arisen as an alternative resource for learning representations of the data that makes it easier to extract useful information when building classifiers [21]. That is, the main idea is to extract the relevant features (explanatory factors) from the observed data without using feature engineering methods. Following this idea and the good results presented in [22,23,24,25,26], in this paper, we aim to use a deep architecture in order to implicitly represent the explanatory factors of the protein sequences as much as possible and at the same time to obtain a model for classification. To this aim, we propose to use aligned GPCR sequences, which are translated into a numeric form by using an amino acid property index [27]. In the first stage, a hydrophobicity-related index is selected (because of its importance in determining the structure and function of GPCRs [14]) as the input for several deep architectures in order to choose one of them and find its parameters. After that, the preprocessed amino acid index (AAindex) database [19] is used as the input for training the selected deep architecture in order to implicitly represent the explanatory factors of the protein sequences. Experimental results assessed by the classification accuracy, Matthews’ correlation coefficient (MCC) and balanced error rate (BER) show that using the hydrophobicity index number 531 and a restricted Boltzmann machine (RBM) can achieve performance results (accuracy of 92.9%) similar to those reported in the literature [12,20].
As a second proposal, we hypothesize that using two or more physicochemical property index combinations instead of only one might add information from the sequences that a deep architecture can extract and classify in a better way. Experimental results show that using three hydrophobicity-related index combinations helps to improve the classification performance (accuracy of 94.1%) of an RBM better than those reported in the literature for class C GPCRs without using feature selection methods. The class C subfamily has been chosen for the present study due to structural, functional and therapeutic reasons [28].

2. Materials and Methods

2.1. Datasets

The current study focuses on class C GPCRs, which have become an increasingly important target for new therapies, particularly in areas such as fragile-X syndrome, schizophrenia, Alzheimer’s disease, Parkinson’s disease, epilepsy, L-DOPA-induced dyskinesias, generalized anxiety disorder, migraine, chronic pain, gastroesophageal reflux disorder, hyperparathyroidism, osteoporosis and drug addiction [29].
Because of its specificity, data were taken from GPCRdb (http://gpcrdb.org/) [30], which is defined as a molecular-class information system that collects, combines, validates and disseminates large amounts of heterogeneous data on GPCRs [31]. GPCRdb divides the GPCR superfamily into 5 families: the class A Rhodopsin like, the class B secretin like, the class C metabotropic glutamate/pheromone, vomeronasal receptors (V1R and V3R) and taste receptors (T2R).
Class C GPCRs were selected for analysis because of (i) their structural complexity, (ii) high sequence length variability and (iii) therapeutic relevance. Briefly, (i) whereas all GPCRs are characterized by sharing a common seven-transmembrane (7TM) domain, responsible for G protein/ β -arrestin activation, most class C GPCRs include, in addition, an extracellular large domain, the Venus flytrap (VFT) and a cysteine rich domain (CRD) connecting both [28]. It was till 2014 that the crystal structures of the 7TM domains of two class C receptors had been solved [32,33]. (ii) The full or partial presence of the whole domain structure confers a high sequence length variability to this family. (iii) The involvement of class C GPCRs in many neurological disorders, as previously mentioned, makes this class an attractive target for drug discovery and development.
Class C is, in turn, subdivided into seven types: metabotropic glutamate (mG), calcium sensing (Cs), G A B A B (gB), vomeronasal (Vn), pheromone (Ph), odorant (Od) and taste (Ta). The investigated dataset is available in two forms: unaligned and aligned versions, which can be downloaded as Supplementary Material files. The former and the latter are distributed as shown in Table 1 and Table 2, respectively. The unaligned version is used for experimentation with alignment-free transformations, while the aligned one is used for experimentation with representation learning methods.
When the aligned version is used, each sequence is converted to a basic and numeric form by using an amino acid physicochemical property index taken from the amino acid index (AAindex) database [27]. This database contains three sections: AAindex1, AAindex2 and AAindex3 (Version 9), where AAindex1 contains 544 indices. For our experimentation, we used a preprocessed version of AAindex1, which contains 531 indices. All of them are available as Supporting Information in [19].

2.2. GPCR Representations

There are two main approaches to analyzing GPCR sequences through machine learning methods in order to capture the inherent features of the original sequences: (a) multiple alignment and (b) alignment-free representations. Both of them have been extensively utilized depending on the final application or use. Many methods of both techniques have been developed in the literature achieving good representations, which are confirmed by the corresponding classification results [11,12,13,14,18,19,20]. However, most of them are manually designed ad hoc by specific domain experts as a pre-processing step, which produces the fixed-length inputs for the classification methods. Therefore, the obtained/extracted representations are domain-dependent, which considers only certain factors (such as frequency, order, etc.) of the original sequence. Consequently, the extracted features can be relevant or not when they are used for different applications.

Multiple Sequence Alignment and Alignment-Free Representations

A very common preprocessing step for protein classification is multiple sequence alignment (MSA). The outputs of MSA are sequences of the same length using the one-letter code of the amino acids. Several methods and tools of MSA have been developed for studies of homology and evolutionary relationships between the sequences [34,35,36]. In addition, MSA output can be used as input for machine learning methods applied to classification tasks. Usually, the MSA output is directly used with natural language processing (as n-grams) or similarity matrix-related techniques. When MSA is used, the protein classification results strongly depend on the characteristics of the information provided by the alignment.
On the other hand, alignment-free protein representations have been defined in the literature in order to capture as much relevant information that might be conveyed by an amino acid sequence as possible. Among these, some rely on transformations based on the amino acid physicochemical characteristics, such as the auto-cross-covariance transformation [37,38].
In this paper, we consider a basic and three advanced alignment-free data transformations to obtain fixed-length vectors as input data for supervised classification algorithms. The corresponding transformed resulting datasets are available as Supplementary Material files. The first and most simple one reflects the amino acid composition (AAcomp) of the primary sequence: the relative frequencies of the occurrence of the 20 amino acids are calculated for each sequence resulting in a N × 20 matrix, where N is the number of sequences in the dataset. This transformation does not take into account the relative position of amino acids in the sequence.
The second and third are extensions of the AAcomp, which include sequence-order information. The second is known as pseudo-amino acid composition (PseAA) [39], while the third is formed by a hybrid feature vector, which combines multiscale energy (MSE) and PseAA representations. Both representations have shown a better GPCR classification performance than AAcomp [14,16].
For a GPCR sequence S = R 1 , R 2 , , R L where R i represents the amino acid at position i in the sequence S of length L, the PseAA is defined as:
PseAA = [ P 1 , P 2 , , P 20 , , P Λ ] ,
where Λ = 20 + n × λ ( λ = 0 , 1 , , m is the number of levels used to compute the correlation factors of the amino acids in the sequence, and n is the number of physicochemical properties used as relevant information for the GPCR sequences). Following [14,40,41], we set λ = 21 as the maximum level and n = 2 physicochemical properties (hydrophobicity and hydrophilicity). That is, the PseAA feature vector length is 62, where the first 20 elements are the relative frequencies of occurrence of the 20 amino acids (as AAcomp), and the remaining elements are the first-level to λ -level correlation factors of amino acid sequences for each physicochemical property. In our case, the PseAA transformation of the class C GPCRs was obtained by using the PseAAC server [42].
Now, the wavelet-based MSE representation of a sequence is defined as:
M S E ( k ) = [ d 1 k , d 2 k , , d m k , a m k ] ,
where k = 1 , 2 , , N (N is the total number of GPCRs); d i k is the root mean square energy of wavelet detail coefficients in the corresponding i-th scale; and a m k is the root mean square energy of wavelet approximation coefficients in the m-th scale. For this transformation, the GPCR sequences are first converted into a numeric form by using hydrophobicity values taken from the FHscale [43]. The resulting numeric form takes the role of a digital signal in which the wavelet (Haar) transformation is applied. That is, the approximation ( a m k ) and detailed ( d i k ) coefficients are computed, where the maximum decomposition level (scale) m of a sequence is taken as log 2 ( L ) .
Finally, the MSE and PseAA are concatenated to form a hybrid feature vector as follows:
PseAA - MSE = [ P 1 , P 2 , , P 20 , , P Λ , d 1 k , d 2 k , , d m k , a m k ] .
Major details for computing PseAA and MSE can be found in [14,16,40].
The fourth representation, related by the descriptors obtained in [44], is the A C C transformation [37,38]. Here, time series models are applied to the protein sequences in order to extract their sequential patterns, and consequently, the extracted information is sequence-order dependent. This representation was originally developed in [38] and then applied and modified in [15,37].
The A C C transformation can be described as follows: each sequence is first translated into physicochemical descriptors by representing each amino acid with the five z-scales derived in [44], where these scales are in turn obtained from 26 physicochemical properties. The auto-covariance ( A C ) and cross-covariance ( C C ) variables are then computed from the transformed sequences. The A C measures the correlation of the same descriptor, d, between two residues separated by a lag, l, along the sequence, and it can be calculated as:
A C d ( l ) = i = 1 n l ( v d , i v ¯ d ) ( v d , i + l v ¯ d ) ( n l ) p .
The C C variable measures the correlation of two different descriptors between two residues separated by a lag along the sequence, and it can be computed as:
C C d d ( l ) = i = 1 n l ( v d , i v ¯ d ) ( v d , i + l v ¯ d ) ( n l ) p ,
where l = 1 , , L a g and L a g is the maximal lag, which must be lesser than the length of the shortest sequence in the dataset; n is the total number of amino acids in the sequence; v d , i is the value of descriptor d = 1 , , D ( D = 5 ) of an amino acid in a sequence at position i; v ¯ d is the mean value of descriptor d across all positions; and p is the degree of normalization.
From these, the A C C fixed-length vectors are obtained: first, the A C and C C terms from Equations (4) and (5) are concatenated for each lag ( C ( l ) = [ A C ( l ) C C ( l ) ] ), and then, the A C C is obtained for a maximum lag L a g by concatenating the C ( l ) terms, that is,
A C C ( L a g ) = [ C ( 1 ) , , C ( L a g ) ] .
Here, the length of an A C C feature vector is length ( A C ) × length ( C C ) × L a g = 25 × L a g . Details of this procedure can be found in [15,37].

2.3. GPCR Feature Learning Proposal through the Deep Approach

In recent years, the representation learning field has arisen as an alternative resource for learning representations of the data that makes it easier to extract useful information when building classifiers [21]. That is, the main idea is to extract the relevant features (explanatory factors) from the observed data without using feature engineering methods.
When representation learning methods are applied to GPCR sequences, a fixed-length and as unprocessed as possible representation of them is needed as the input for these methods. For this reason, we take the aligned version (see Table 2) with 259 fixed-length sequences of the GPCR database described in [30].
In our first proposal, each aligned sequence is converted to a basic and numeric form by using an amino acid property index taken from the preprocessed AAindex1 database [19,27]. That is, the sequence S = R 1 , R 2 , , R L of length L is now represented by:
S = I 1 k , I 2 k , , I L k
where I i k indicates the corresponding numeric value of the amino acid R i using the k-th amino acid property index. In the case that a gap is presented in a sequence, it is replaced by a zero value. From Equation 7, it is observed that neither occurrence frequency, nor order information from S are included in S .
In this way, for the class C GPCRs dataset, we form k = 1 , 2 , K input datasets, where K = 531 is the total number of indices of the preprocessed [19] amino acid properties index database [27]. Each k-th dataset is used as input for training a deep architecture in order to implicitly represent the explanatory factors of the protein sequences as much as possible and at the same time to obtain a model for classification. For illustration, Figure 1 shows how a sequence is used for training a deep architecture. It is observed from this figure that we can use different kinds of deep architectures to represent a dataset. In this paper, we experiment with basic and functional architectures, namely: (a) autoencoders, (b) convolutional neural networks (CNN) and (c) restricted Boltzmann machine in the first stage in order to select the architecture that best represents the original dataset. In this stage, a hydrophobicity-related index is selected because of its importance in determining the structure and function of GPCRs [14].
After the selection of a deep model, we proceed to find the right number of hidden layers and the number of neurons in each hidden layer by using a grid search in the range of [ 1 , 2 , , 10 ] and [ 100 , 200 , 300 , 500 , 800 ] , respectively. The number of neurons in a hidden layer is selected to be lesser or greater than the number of input neurons in order to allow the codification or magnification of the information from the inputs.
Once the number of hidden layers and the number of neurons in each hidden layer is found, we look into a neighborhood of the number of neurons in a hidden layer in order to refine and confirm the results. Next, we use the best setting (deep model, number of hidden layers and number of neurons in a hidden layer) to train a model using each one of the 531 indices from the AAindex database. This process will help to select the physicochemical index that in conjunction with the selected deep architecture represents the explanatory factors of the GPCR sequences.
Now, we hypothesize (as a second proposal) that using two or more physicochemical property indices instead of only one might add information from the sequences that a deep architecture can extract and classify in a better way. This is carried out by combining the physicochemical indices for each amino acid in a sequence. That is, if a GPCR sequence has a length of L (259), after combination, its length is L × n , where n is the number of indices. For n = 2 , the sequence is represented by:
S = I 1 j , I 1 k , I 2 j , I 2 k , , I L j , I L k
where I i j , I i k indicates the combination of the corresponding numeric value of the amino acid R i using the j-th and k-th amino acid property indices.

2.4. Deep and Conventional Supervised Learning Methods

In this paper, we experiment with basic and functional deep architectures, namely: autoencoders, convolutional neural networks and restricted Boltzmann machine in the first stage in order to select the architecture that best represents the original dataset. These architectures help to discover complex structures in datasets, which are used to compute the representations in each layer. These distributed representations lead to improved generalization for different tasks.
Convolutional neural networks have been widely applied to the recognition of objects in digital images. The architecture of a typical deep CNN is structured as a series of convolutional layers and pooling (subsampling) layers. The role of a convolutional layer is to detect local conjunctions of features from the previous layer, whereas the role of a pooling layer is to merge semantically similar features into one [45].
A stacked autoencoder is used mainly to encode the inputs into some representation so that the inputs can be reconstructed from that representation. In practice, the output representation can also be used to initialize a deep neural network for multi-class classification. In this paper, a stochastic version of the autoencoder is used, namely the denoising autoencoder, which avoids learning the identity function [46,47].
A stacked RBM is a particular type of energy-based model with hidden variables, which has the restriction that its neurons must form a bipartite graph. An RBM is formed by a visible input layer and a hidden layer and connections between them, but not within a layer. Usually, the contrastive divergence algorithm is utilized as the unsupervised training procedure to detect features from the inputs [46,48].
For classification tasks, a deep belief network or simply a deep neural network can be constructed by stacking RBMs or autoencoders where the top layer (n) is used as the classifier’s output. In the first stage, a deep network of this kind is trained without supervision using n 1 layers to detect the main features of the inputs. After this, the n-th layer is added to the network and trained with supervision through the error backpropagation algorithm to perform classification.
On the other hand, we also compare the obtained RBM results with some conventional classifiers such as k-nearest neighbor (k-NN), decision tree (DT), multilayer perceptron (MLP) and support vector machine (SVM).
k-NN is one of the simplest classifiers. It finds the k points in the training set that are nearest to the test input, then counts how many members of each class are present in the corresponding neighborhood (formed by the k points) and returns a class label belonging to the most common class in such a neighborhood.
Another basic classifier is a DT or classification tree. It partitions the feature space into hyperrectangles with sides parallel to the axes and then fits a simple model in each one. That is, the sequence of decisions is applied to individual features. In the resulting (tree-like) structure, an internal node represents a test on a variable or attribute, and a leaf node represents a class label.
MLP is a sophisticated feedforward neural network architecture, which can be trained in a supervised manner through the error backpropagation algorithm. The network contains layers of hidden neurons, which extract meaningful features from the input vectors. Each neuron in the network uses a nonlinear activation function, which helps to model non-linearities in its input-output relation.
A more sophisticated and widely-applied nonlinear classifier is SVM. It separates the input data points by mapping them into a high-dimensional feature space where a hyper-plane is constructed. This hyper-plane creates a decision surface, which has a maximum distance to the nearest points in the feature space. That is, two key concepts are involved in the design of an SVM: large margin separation and kernel functions. The former concept means that the constructed hyper-plane should be placed as far as possible away from the points in different classes. The latter concept helps to calculate the similarity between points in the corresponding feature space, which allows an SVM to generate nonlinear decision boundaries [49,50]. In this paper, a radial basis function was used as the kernel (due to the good results presented in [12,20]), where a grid search was carried out to find the regularization parameter C and the kernel width parameter σ .

Performance Assessment Measures

The performance measures used in the experiments are classification accuracy, MCC and BER. Accuracy is widely known and used as the proportion of correctly-classified cases. MCC and BER are commonly used as performance measures when the analyzed datasets are class-unbalanced. All of them can be naturally extended from the binary to the multi-class context [51].
Let us assume a classification problem with S samples and G classes and two functions defined as t c , p c : S { 1 , , G } , where t c ( s ) and p c ( s ) return the true and the predicted class of s, respectively. The corresponding square confusion matrix C is:
C i j = | { s S : t c ( s ) = i and p c ( s ) = j } | ,
in which the i j -th entry of C is the number of cases of true class i that have been assigned to class j by the classifier. Then, the confusion matrix notation can be used to define the accuracy, MCC and BER as:
accuracy = k = 1 G C k k i , j = 1 G C i j ,
BER = 1 G Σ i = 1 G Σ j = 1 , j i G C i j Σ j = 1 G C i j ,
MCC = k , l , m = 1 G C k k C m l C l k C k m k = 1 G l = 1 G C l k f , g = 1 f k G C g f k = 1 G l = 1 G C k l f , g = 1 f k G C f g .
BER is the average of the errors on each class, which takes values in the interval [ 0 , 1 ] . Then, 0 means perfect classification where no error contribution per class was found, and 1 means an extreme misclassification case where items for each class are misclassified.
MCC is commonly used in the bioinformatics field and takes values in the interval [ 1 , 1 ] , where 1 means complete correlation (perfect classification), 0 means no correlation (all samples have been classified to be of only one class) and −1 indicates a negative correlation (extreme misclassification case). MCC is recommended as an optimal tool for practical tasks, since it presents a good trade-off among discriminatory ability, consistency and coherent behavior with a varying number of classes, unbalanced datasets and randomization [52].

3. Results and Discussion

The experimental results reported in this section aim to assess the ability of representation learning methods to extract the explanatory factors from the observed class C GPCR sequences without using feature engineering methods. For this purpose, two kinds of experimentation are designed.
Firstly, unaligned amino acid sequences are transformed according to the alignment-free transformations described in Section 2.2.1 in order to extract the relevant features that will help to gauge the classification performance using conventional supervised methods. Secondly, aligned amino acid sequences converted to a basic and numeric form are used as input for deep learning methods in order to implicitly represent the explanatory factors of the protein sequences. These models have the characteristic that at the same time the representation is extracted, a classification model is also obtained, which is assessed through classification performance.

3.1. Class C GPCRs Classification Using Alignment-Free Representations

The goal of the experiments in this subsection is two-fold. Firstly, we aimed to gauge the ability of the alignment-free amino acid sequence transformations to capture the inherent relevant features of class C GPCR subfamilies through supervised classification models. Secondly, we aimed to compare the performance of four conventional supervised models in terms of classification performance.
For the first set of experiments, the alignment-free transformations described in Section 2.2.1 are used in order to obtain the fixed-length feature vectors of the class C GPCRs unaligned dataset (see Table 1). This means that the AAcomp, PseAA, PseAA-MSE and ACC transformations are computed to obtain the corresponding four datasets as input for classification algorithms.
Following Section 2.2.1, a feature vector of the AAcomp dataset has a length of 20; for the PseAA dataset, the length is 62; and for the PseAA-MSE, the length is 74; taking a maximum decomposition level of m = 11 ( log 2 ( max { L 1 , L 2 , , L N } ) 11 , where L i is the length of the sequence i). In the case of the ACC transformation, it uses two parameters that must be set to adequate values prior to classification: the maximum L a g and the degree of normalization p. In this study, we set both as L a g = 13 and p = 0 . 5 , since the unaligned dataset is almost the same as in [11,12]. Then, the length of an ACC-transformed feature vector is 25 × L a g = 325 .
For the second set of experiments, we selected two baseline and two sophisticated (non-linear) classifiers. Here, the k-nearest neighbor, decision tree, multilayer perceptron trained with the backpropagation algorithm and support vector machines were used. For k-NN, different neighborhoods were tried in the range k = 1 , , 10 . Different settings for the number of hidden layers ( h l ) and number of neurons in a hidden layer ( n h l ) for MLP were used as h l = [ 1 , 2 , 3 , 4 , 5 ] and n h l = [ 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 , 100 ] . In the case of SVM classifier, the radial basis function kernel was used, which utilizes two parameters that must be identified in order to accurately predict unknown data: C and γ . For this, a grid search is carried out in the ranges C = [ 1 , 16 ] and γ = [ 2 10 , 2 5 ] as in [12,20]. For these classifiers, only the parameters that lead to the best classification performance are reported.
For all conventional classifiers, the corresponding implementation available in the Weka (Version 3.6) toolbox [53] was used. It also allows data preprocessing where data normalization in the range [ 0 , 1 ] was carried out using the min max normalization technique. In order to estimate the average classification performance, 10-fold cross-validation is used.
The average classification accuracy results using alignment-free representation datasets with the above described classifiers are shown in Table 3. From these results, SVM is shown to outperform the rest of the classifiers in terms of accuracy, which is similar to that reported in the literature [12,20].
On the other hand, the alignment-free transformation that best captures relevant features through classifiers is ACC, except for decision trees. This is followed by PseAA and PseAA-MSE transformation, which indicates the importance of adding sequence-order information in transformed feature vectors [11,14,16].

3.2. Class C GPCRs Classification Using Representation Learning

The goal of the experiments in this subsection is two-fold. Firstly, we aimed to gauge the ability of representation learning methods to extract the explanatory factors directly from the observed data sequences through deep learning approaches. Secondly, we aimed to compare the performance of deep and conventional learning models in terms of classification performance.
As stated in the first proposal of Section 2.3, the aligned dataset (see Table 2) is converted to a numeric form by using an amino acid property index taken from AAindex database [19,27]. From the 531 indices, in the first stage, we selected a hydrophobicity-related index, because of its importance in determining the structure and function of GPCRs [14], then the hydrophobicity index 2 is chosen. The resulting dataset is named AAhydro.
Three common deep architectures were selected for experimentation: autoencoders, restricted Boltzmann machines and convolutional neural networks. In order to select the best architecture, which will be tuned in a posterior step, a basic configuration for each was used: two hidden layers and 700 neurons for each layer. To estimate the classification performance of the deep models, stratified 10-fold cross-validation was carried out.
The corresponding implementation of deep architectures was taken from [54,55]. The average classification accuracy results of the different deep architectures are shown in Table 4. Here, it is observed that RBM outperforms the other deep architectures in terms of classification accuracy. It is worth noting that RBM is modeled through a Gaussian-Bernoulli distribution, which naturally allows real-valued inputs. Although it is widely-known that CNNs have good performance for image pattern recognition (where large datasets are used), this is not the case for class C GPCR classification where the amount of analyzed data is not large enough.
From here on, RBM is selected as the deep architecture trained with the backpropagation algorithm where gradient descent is accelerated by Nesterov’s method [56]. In order to find the right configuration for the number of hidden layers and the number of neurons for layer of RBM, an ad hoc and coarse grid search was carried out in the ranges [ 1 , 2 , , 10 ] and [ 100 , 200 , 300 , 500 , 800 ] , respectively. The corresponding average classification accuracy results of this search are progressively shown in Table 5, Table 6, Table 7 and Table 8.
From Table 5 and Table 6, it is observed that the right number of neurons for the first and second layer is around 500. Then, the number of neurons for the third and fourth layer is around 500. Table 7 and Table 8 show that no improvement is achieved when we add more hidden layers. We also tried five hidden layers, but the results are worse than previous tables; therefore, they are not reported.
Since the best results are obtained using 500 neurons for two hidden layers, we proceed with a fine grid search of around 500 neurons for each layer. Then, we tried the range [400, 450, 500, 550, 600] for each layer. The average classification accuracy results for this search are shown in Table 9. Again, the best results are obtained with 500 neurons for the first and second layers.
From previously-obtained results, we selected two hidden layers and 500 neurons for each layer as the right configuration for RBM. Now, we train the selected RBM architecture using each one of the 531 indices from the preprocessed AAindex database [19]. This process will help us to select the amino acid physicochemical property index that in conjunction with RBM represents the explanatory factors of the class C GPCR sequences.
The average classification results of the 12 amino acid physicochemical property indices with the highest classification accuracy are shown in Table 10. Since the resulting datasets are unbalanced (see Table 2), the MCC and BER measures are also presented in order to compare them with accuracy results.
It is observed from Table 10 that the amino acid property index number 531 in conjunction with RBM represents in a better way the explanatory factors of the class C GPCR sequences than the initial hydrophobicity index number 2. Although indices 531 and 485 have similar accuracy results, the BER measure is in favor of the hydrophobicity index number 531, which indicates a minimum mean misclassification for each class. Furthermore, this result is similar to that reported in the literature using feature engineering methods with SVM classifier (Konig 2013, 2014), but in contrast, an RBM learns representations directly from the observed data sequences.
In order to find out to what extent each of the seven class C GPCR types described in Section 2.1 can be discriminated from the rest and how each of them influences the overall classification performance, the four highest accuracy results represented by their corresponding amino acid property indices are presented in Figure 2 for all these types. Here, it is clear that the overall pattern of supervised classification is quite stable across amino acid property indices, except for index 166. The tendency is that the odorant and pheromone subfamilies are those that contribute less to the overall classification, which is a pattern similarly obtained in [11,12] with a difference (in favor of RBM) in the vomeronasal subfamily results. In this figure, five out of seven subfamilies (including vomeronasal) have high classification performance. The exception for this pattern is the results from index 166, which indicate that RBM cannot extract the explanatory factors of calcium sensing receptors, but in contrast, it has the highest recognition rate for the most difficult subfamily (odorant) to discriminate.
Results from Figure 2 suggest that if feeding an RBM with information of two or more amino acid property indices instead of one, it probably could extract and represent more inherent and hidden information from GPCR sequences and consequently improve classification performance. Therefore, as a second proposal, we can combine two or more amino acid property indices as inputs for the RBM architecture previously selected.
Providing two amino acid property indices to an RBM means that the input sequence is first converted to a numeric form as I 1 j , I 1 k , I 2 j , I 2 k , , I L j , I L k , where I i j , I i k indicates the combination of the corresponding numeric value of the amino acid i using the j-th and k-th amino acid property indices. For the next experiments, we combine pairs of indices from Table 10 in order to reduce the search space.
The average classification results of the five amino acid property index combinations with the highest classification accuracy are shown in Table 11. From this table, it is observed that a combination of indices 65 and 205 in conjunction with RBM can represent the class C GPCRs types in a better way than using only one index, then classification performance is improved and confirmed by all the performance measures. This result outperformed the one obtained in [12] using feature engineering methods and is similar to [20], obtained by feature selection methods, with the difference being that we did not resort to these kinds of methods.
As in the previous experiment, we investigate the class-specific contribution to overall classification for class C GPCR types, and this is shown in Figure 3. The tendency and pattern described by these results are very similar to the ones described by using only one amino acid property index, but this, time the recognition rate of the most difficult subfamily to discriminate is improved.
Now, we proceed with combinations of three and more amino acid property index combinations. Since the results were not improved using four or more combinations, we only present the performance results with three index combinations in Table 12 and Figure 4. The classification results in Table 12 slightly improve the highest obtained using a combination of two indices. In particular, the combination of indices 485, 247 and 193 is better than the combination of 65 and 205 in terms of MCC and BER measures, but the rest of the combinations are not better than those shown in Table 11.
From Figure 4, it is observed that the same pattern described in Figure 2 and Figure 3 is found, including the recognition rate improvement of the odorant type.
A summary of the highest classification performance using one, two and three amino acid property index combinations is presented in Figure 5. Here, it is observed that the highest results are addressed by the ability of the recognition (discrimination) of odorant and pheromone subfamilies. According to [11,12], subfamilies related to the odor function, such as vomeronasal, pheromone and odorant, are the most difficult to discriminate. However, Figure 5 shows that an RBM using one, two or three amino acid property index combinations can perfectly discriminate the vomeronasal type from the rest. Moreover, an RBM using the 485-247-193 index combination can also highly recognize the pheromone and odorant subfamilies. These results reveal the important contribution of hydrophobicity-related index combinations to correct amino acid sequence classification. This is not an unexpected result considering that GPCRs are membrane proteins, and thus, hydrophobic residues are highly present along the sequence and important both for receptor structure and function [14].
Finally, we compare the performance obtained with the highest classification accuracy results of RBM using one, two and three amino acid property index combinations with conventional supervised classification methods. For this purpose, the datasets obtained with one, two and three amino acid property index combinations are used as input for classification methods, such as SVM, k-NN and DT. The corresponding parameters of SVM and k-NN were as in Section 3.1, and the best average classification accuracy results are reported in Table 13.
From Table 13, it can be observed that RBM can extract and represent the inherent and hidden information of class C GPCRs in a better way than conventional classification methods, which is confirmed by the accuracy, MCC and BER measure results. These results outperformed those reported in the literature [11,12,20] for class C GPCR classification without using feature selection methods.

4. Conclusions

Given the interest in class C receptors in pharmacology and in the absence of much knowledge regarding their complete 3D crystal structures, the investigation of their functionality can be approached through the analysis of their primary structure in the form of amino acid sequences. For this, many works reported in the literature [11,13,14,16,19,20,37] have coincided with the fact that sequence representation is a key factor for the GPCR classification task. Following this idea and opposite to the standard procedure of applying feature engineering methods for sequence representation, the use of the representation learning approach for automatically acquiring the features that best represent the class C GPCR sequences is proposed in this paper. That is, the AAindex database is used as the input for training a stacked RBM in order to implicitly represent the explanatory factors of the protein sequences. Experimental results assessed by classification accuracy, MCC and BER show that using the hydrophobicity index number 531 in conjunction with an RBM can achieve performance results similar to those reported in the literature. Furthermore, it is also shown that using three hydrophobicity-related index combinations helps to improve the classification performance of an RBM better than those reported in the literature for class C GPCRs without using feature selection methods.
Besides, type-specific classification results have shown that the discriminative and representative ability of the stacked RBM for each type varies according to the provided amino acid property index combinations, but keeping, in general, a stable and consistent classification pattern across all index combinations. Moreover, and importantly for the problem of recognizing the subfamilies related to the odor function, the experimental results indicate that RBM in conjunction with any amino acid physicochemical property index combinations can quite accurately represent and discriminate the vomeronasal type, and specifically using the 485-247-193 index combination, it can also highly recognize the pheromone and odorant subfamilies.
Motivated by the fact that relevant features of two class C GPCR subfamilies (related to the odor function) are difficult to represent and classifiers confuse them, a multi-label learning approach that allows an instance to belong to different classes is considered as future work. Furthermore, a pertinent evaluation of the three hydrophobicity-related index combinations found in this work should be carried out at a biochemical level.

Supplementary Materials

The following are available online, Files S1–S26: datasets used in the experiments and described in the main text.

Acknowledgments

This work was partially supported by the Mexican National Council for Science and Technology (Consejo Nacional de Ciencia y Tecnología (CONACyT)), under the Catedra Program Number 1170. Furthermore, it was partially supported by the Spanish Ministerio de Economía y Competitividad under the project SAF2014-5839.

Author Contributions

R.C.-B. and E.-G.R.-P. conceived of and designed the experiments. E.-G.R.-P. performed the experiments. R.C.-B. and J.G. analyzed the data. J.G. and R.C.-B. wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Katritch, V.; Cherezov, V.; Stevens, R.C. Structure-Function of the G Protein–Coupled Receptor Superfamily. Annu. Rev. Pharmacol. Toxicol. 2013, 53, 531–556. [Google Scholar] [CrossRef] [PubMed]
  2. DeVree, B.T.; Mahoney, J.P.; Vélez-Ruiz, G.A.; Rasmussen, S.G.F.; Kuszak, A.J.; Edwald, E.; Fung, J.J.; Manglik, A.; Masureel, M.; Du, Y.; et al. Allosteric coupling from G protein to the agonist-binding pocket in GPCRs. Nature 2016, 535, 182–186. [Google Scholar] [CrossRef] [PubMed]
  3. Cahill, T.J.; Thomsen, A.R.B.; Tarrasch, J.T.; Plouffe, B.; Nguyen, A.H.; Yang, F.; Huang, L.Y.; Kahsai, A.W.; Bassoni, D.L.; Gavino, B.J.; et al. Distinct conformations of GPCR—β-arrestin complexes mediate desensitization, signaling, and endocytosis. Proc. Natl. Acad. Sci. USA 2017, 114, 2562–2567. [Google Scholar]
  4. Fredriksson, R.; Lagerström, M.C.; Lundin, L.G.; Schiöth, H.B. The G-Protein-Coupled Receptors in the Human Genome Form Five Main Families. Phylogenetic Analysis, Paralogon Groups, and Fingerprints. Mol. Pharmacol. 2003, 63, 1256–1272. [Google Scholar] [CrossRef] [PubMed]
  5. Cooke, R.M.; Brown, A.J.; Marshall, F.H.; Mason, J.S. Structures of G protein-coupled receptors reveal new opportunities for drug discovery. Drug Discov. Today 2015, 20, 1355–1364. [Google Scholar] [CrossRef] [PubMed]
  6. Eddy, M.T.; Lee, M.Y.; Gao, Z.G.; White, K.L.; Didenko, T.; Horst, R.; Audet, M.; Stanczak, P.; McClary, K.M.; Han, G.W.; et al. Allosteric Coupling of Drug Binding and Intracellular Signaling in the A2A Adenosine Receptor. Cell 2018, 172, 68–80. [Google Scholar] [CrossRef] [PubMed]
  7. Hill, S.J.; Watson, S.P. Fluorescence Approaches Unravel Spatial and Temporal Aspects of GPCR Organisation, Location, and Intracellular Signalling. Trends Pharmacol. Sci. 2018, 39, 91–92. [Google Scholar] [CrossRef] [PubMed]
  8. Hertig, S.; Latorraca, N.R.; Dror, R.O. Revealing Atomic-Level Mechanisms of Protein Allostery with Molecular Dynamics Simulations. PLoS Comput. Biol. 2016, 12, e1004746. [Google Scholar] [CrossRef] [PubMed]
  9. Sriram, K.; Insel, P.A. GPCRs as targets for approved drugs: How many targets and how many drugs? Mol. Pharmacol. 2018. [Google Scholar] [CrossRef] [PubMed]
  10. Peng, Y.; McCorvy, J.G.; Harpsøe, K.; Lansu, K.; Yuan, S.; Popov, P.; Qu, L.; Pu, M.; Che, T.; Nikolajsen, L.F.; et al. 5-HT2C Receptor Structures Reveal the Structural Basis of GPCR Polypharmacology. Cell 2018, 172, 719–730. [Google Scholar]
  11. Cruz-Barbosa, R.; Vellido, A.; Giraldo, J. The influence of alignment-free sequence representations on the semi-supervised classification of class C G protein-coupled receptors. Med. Biol. Eng. Comput. 2015, 53, 137–149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. König, C.; Cruz-Barbosa, R.; Alquézar, R.; Vellido, A. SVM-Based Classification of Class C GPCRs from Alignment-Free Physicochemical Transformations of Their Sequences. In Proceedings of the 17th New Trends in Image Analysis and Processing; Springer: Berlin/Heidelberg, Germany, 2013; pp. 336–343. [Google Scholar]
  13. Karchin, R.; Karplus, K.; Haussler, D. Classifying G-protein coupled receptors with support vector machines. Bioinformatics 2002, 18, 147–159. [Google Scholar] [CrossRef] [PubMed]
  14. Rehman, Z.U.; Khan, A. G-protein-coupled receptor prediction using pseudo-amino-acid composition and multiscale energy representation of different physiochemical properties. Anal. Biochem. 2011, 412, 173–182. [Google Scholar] [CrossRef] [PubMed]
  15. Otaki, J.M.; Mori, A.; Itoh, Y.; Nakayama, T.; Yamamoto, H. Alignment-Free Classification of G-Protein-Coupled Receptors Using Self-Organizing Maps. J. Chem. Inf. Model. 2006, 46, 1479–1490. [Google Scholar] [CrossRef] [PubMed]
  16. Qiu, J.D.; Huang, J.H.; Liang, R.P.; Lu, X.Q. Prediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid composition: An approach from discrete wavelet transform. Anal. Biochem. 2009, 390, 68–73. [Google Scholar] [CrossRef] [PubMed]
  17. Liao, Z.; Ju, Y.; Zou, Q. Prediction of G Protein-Coupled Receptors with SVM-Prot Features and Random Forest. Scientifica 2016, 2016, 8309253. [Google Scholar] [CrossRef] [PubMed]
  18. Yang, Y.; Lu, B.; Yang, W. Classification of protein sequences based on word segmentation methods. In Proceedings of the 6th AsiaPacific Bioinformatics Conference, Kyoto, Japan, 14–17 January 2008; pp. 177–186. [Google Scholar]
  19. Liu, B.; Wang, X.; Chen, Q.; Dong, Q.; Lan, X. Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection. PLoS ONE 2012, 7, e46633. [Google Scholar] [CrossRef] [PubMed]
  20. König, C.; Alquézar, R.; Vellido, A.; Giraldo, J. Reducing the n-gram feature space of class C GPCRs to subtype-discriminating patterns. J. Integr. Bioinform. 2014, 11, 99–115. [Google Scholar] [CrossRef] [PubMed]
  21. Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef] [PubMed]
  22. Lin, Z.; Lanchantin, J.; Qi, Y. MUST-CNN: A Multilayer Shift-and-Stitch Deep Convolutional Architecture for Sequence-based Protein Structure Prediction. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), Phoenix, AZ, USA, 12–17 February 2016; pp. 27–34. [Google Scholar]
  23. Wei, L.; Ding, Y.; Su, R.; Tang, J.; Zou, Q. Prediction of human protein subcellular localization using deep learning. J. Parallel Distrib. Comput. 2017. [Google Scholar] [CrossRef]
  24. Mohamed, A.; Dahl, G.E.; Hinton, G. Acoustic Modeling Using Deep Belief Networks. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 14–22. [Google Scholar] [CrossRef]
  25. Cadieu, C.F.; Hong, H.; Yamins, D.L.K.; Pinto, N.; Ardila, D.; Solomon, E.A.; Majaj, N.J.; DiCarlo, J.J. Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition. PLoS Comput. Biol. 2014, 10, e1003963. [Google Scholar] [CrossRef] [PubMed]
  26. Cireşan, D.; Meier, U.; Masci, J.; Schmidhuber, J. Multi-column deep neural network for traffic sign classification. Neural Netw. 2012, 32, 333–338. [Google Scholar] [CrossRef] [PubMed]
  27. Kawashima, S.; Kanehisa, M. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res. 2008, 36, D202–D205. [Google Scholar] [CrossRef] [PubMed]
  28. Pin, J.P.; Galvez, T.; Prézeau, L. Evolution, Structure, and Activation Mechanism of Family 3/C G-protein-coupled receptors. Pharmacol. Ther. 2003, 98, 325–354. [Google Scholar] [CrossRef]
  29. Kniazeff, J.; Prézeau, L.; Rondard, P.; Pin, J.P.; Goudet, C. Dimers and beyond: The functional puzzles of class C GPCRs. Pharmacol. Ther. 2011, 130, 9–25. [Google Scholar] [CrossRef] [PubMed]
  30. Isberg, V.; Vroling, B.; van der Kant, R.; Li, K.; Vriend, G.; Gloriam, D. GPCRDB: An information system for G protein-coupled receptors. Nucleic Acids Res. 2014, 42, D422–D425. [Google Scholar] [CrossRef] [PubMed]
  31. Vroling, B.; Sanders, M.; Baakman, C.; Borrmann, A.; Verhoeven, S.; Klomp, J.; Oliveira, L.; de Vlieg, J.; Vriend, G. GPCRDB: Information system for G protein-coupled receptors. Nucleic Acids Res. 2011, 39 (Suppl. 1), D309–D319. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  32. Wu, H.; Wang, C.; Gregory, K.J.; Han, G.W.; Cho, H.P.; Xia, Y.; Niswender, C.M.; Katritch, V.; Meiler, J.; Cherezov, V.; Conn, P.J.; Stevens, R.C. Structure of a class C GPCR Metabotropic Glutamate Receptor 1 bound to an allosteric modulator. Science 2014, 344, 58–64. [Google Scholar] [CrossRef] [PubMed]
  33. Doré, A.S.; Okrasa, K.; Patel, J.C.; Serrano-Vega, M.; Bennett, K.; Cooke, R.M.; Errey, J.C.; Jazayeri, A.; Khan, S.; Tehan, B.; et al. Structure of class C GPCR metabotropic glutamate receptor 5 transmembrane domain. Nature 2014, 551, 557–562. [Google Scholar] [CrossRef] [PubMed]
  34. Edgar, R.C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32, 1792–1797. [Google Scholar]
  35. Sievers, F.; Wilm, A.; Dineen, D.; Gibson, T.J.; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; Söding, J.; et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol. 2011, 7, 539. [Google Scholar] [CrossRef] [PubMed]
  36. Notredame, C.; Higgins, D.G.; Heringa, J. T-coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 2000, 302, 205–217. [Google Scholar] [CrossRef] [PubMed]
  37. Lapinsh, M.; Gutcaits, A.; Prusis, P.; Post, C.; Lundstedt, T.; Wikberg, J.E. Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences. Protein Sci. 2002, 11, 795–805. [Google Scholar] [CrossRef] [PubMed]
  38. Wold, S.; Jonsson, J.; Sjörström, M.; Sandberg, M.; Rännar, S. DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures. Anal. Chim. Acta 1993, 277, 239–253. [Google Scholar] [CrossRef]
  39. Chou, K.C. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 2001, 44, 60. [Google Scholar] [CrossRef]
  40. Chou, K.C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 2005, 21, 10–19. [Google Scholar] [CrossRef] [PubMed]
  41. Chou, K.C.; Cai, Y.D. Prediction of Membrane Protein Types by Incorporating Amphipathic Effects. J. Chem. Inf. Model. 2005, 45, 407–413. [Google Scholar] [CrossRef] [PubMed]
  42. Shen, H.B.; Chou, K.C. PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem. 2008, 373, 386–388. [Google Scholar] [CrossRef] [PubMed]
  43. Fauchereand, J.; Pliska, V. Hydrophobic parameters of amino-acid side chains from the partitioning of N-acetyl-amino-acid amides. Eur. J. Med. Chem. 1983, 18, 369–375. [Google Scholar]
  44. Sandberg, M.; Eriksson, L.; Jonsson, J.; Sjöström, M.; Wold, S. New Chemical Descriptors Relevant for the Design of Biologically Active Peptides. A Multivariate Characterization of 87 Amino Acids. J. Med. Chem. 1998, 41, 2481–2491. [Google Scholar] [CrossRef] [PubMed]
  45. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  46. Bengio, Y. Learning Deep Architectures for AI. Found. Trends Mach. Learn. 2009, 2, 1–127. [Google Scholar] [CrossRef]
  47. Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P. Extracting and composing robust features with denoising autoencoders. In Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML’08), Helsinki, Finland, 5–9 July 2008; Cohen, W., McCallum, A., Roweis, S., Eds.; ACM: New York, NY, USA, 2008; pp. 1096–1103. [Google Scholar]
  48. Hinton, G.E.; Osindero, S.; Teh, Y.W. A Fast Learning Algorithm for Deep Belief Nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
  49. Vapnik, V.N. Statistical Learning Theory; J. Wiley and Sons: New York, NY, USA, 1998. [Google Scholar]
  50. Ben-Hur, A.; Ong, C.S.; Sonnenburg, S.; Schölkopf, B.; Rätsch, G. Support Vector Machines and Kernels for Computational Biology. PLoS Comput. Biol. 2008, 4, e1000173. [Google Scholar] [CrossRef] [PubMed]
  51. Gorodkin, J. Comparing two K-category assignments by a K-category correlation coefficient. Comput. Biol. Chem. 2004, 28, 367–374. [Google Scholar]
  52. Jurman, G.; Riccadonna, S.; Furlanello, C. A comparison of MCC and CEN Error Measures in Multi-Class Prediction. PLoS ONE 2012, 7, e41882. [Google Scholar] [CrossRef] [PubMed]
  53. Witten, I.H.; Frank, E.; Hall, M.A. Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed.; Morgan Kaufmann: Burlington, MA, USA, 2011. [Google Scholar]
  54. Rong, X. Deepnet: Deep Learning Toolkit in R. Available online: https://cran.r-project.org/web/packages/deepnet/index.html (accessed on 20 December 2017).
  55. Software-Foundation, A. MXNet-R API. Available online: https://mxnet.incubator.apache.org/api/r/index.html (accessed on 20 December 2017).
  56. Sutskever, I. Training Recurrent Neural Networks. Ph.D. Thesis, Department of Computer Science, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
Figure 1. Deep architecture training proposal scheme.
Figure 1. Deep architecture training proposal scheme.
Molecules 23 00690 g001
Figure 2. Class-specific percentage of contribution to overall classification using four amino acid property indices with the highest classification accuracy. metabotropic glutamate (mG), calcium sensing (Cs), G A B A B (gB), vomeronasal (Vn), pheromone (Ph), odorant (Od) and taste (Ta).
Figure 2. Class-specific percentage of contribution to overall classification using four amino acid property indices with the highest classification accuracy. metabotropic glutamate (mG), calcium sensing (Cs), G A B A B (gB), vomeronasal (Vn), pheromone (Ph), odorant (Od) and taste (Ta).
Molecules 23 00690 g002
Figure 3. Class-specific percentage of contribution to overall classification using two amino acid property index combinations with the highest classification accuracy. The tendency of index combination results is similar to the results shown in Figure 2, but this time, the Ph and Od subfamilies are better discriminated.
Figure 3. Class-specific percentage of contribution to overall classification using two amino acid property index combinations with the highest classification accuracy. The tendency of index combination results is similar to the results shown in Figure 2, but this time, the Ph and Od subfamilies are better discriminated.
Molecules 23 00690 g003
Figure 4. Class-specific percentage of contribution to overall classification using three amino acid property index combinations with the highest classification accuracy. As in Figure 2 and Figure 3, the tendency of index combination 485-247-193 is similar, but the recognition rate of the Od subfamily is highly improved.
Figure 4. Class-specific percentage of contribution to overall classification using three amino acid property index combinations with the highest classification accuracy. As in Figure 2 and Figure 3, the tendency of index combination 485-247-193 is similar, but the recognition rate of the Od subfamily is highly improved.
Molecules 23 00690 g004
Figure 5. Class-specific percentage of contribution to overall classification using one, two and three amino acid property index combinations with the highest classification accuracy.
Figure 5. Class-specific percentage of contribution to overall classification using one, two and three amino acid property index combinations with the highest classification accuracy.
Molecules 23 00690 g005
Table 1. Distribution of the unaligned class C GPCRs.
Table 1. Distribution of the unaligned class C GPCRs.
TypeNumber of seq.
Calcium sensing46
G A B A B 193
Metabotropic glutamate321
Odorant91
Pheromone372
Taste65
Vomeronasal304
Total1392
Table 2. Distribution of the aligned class C GPCRs.
Table 2. Distribution of the aligned class C GPCRs.
TypeNumber of seq.
Calcium sensing36
G A B A B 139
Metabotropic glutamate296
Odorant82
Pheromone356
Taste60
Vomeronasal230
Total1199
Table 3. Average classification accuracy (%) results of four classifiers using alignment-free representation datasets. AAcomp, amino acid composition; PseAA, pseudo-amino acid composition.
Table 3. Average classification accuracy (%) results of four classifiers using alignment-free representation datasets. AAcomp, amino acid composition; PseAA, pseudo-amino acid composition.
TransformationMLPSVMDTk-NN
AAcomp83.9887.6472.1384.99
ACC89.0891.6761.3587.79
PseAA88.2288.8674.2187.43
PseAA-MSE87.5788.5172.4988.00
Table 4. Average classification results using the amino acid hydrophobicity-related index (AAhydro) set.
Table 4. Average classification results using the amino acid hydrophobicity-related index (AAhydro) set.
Deep architectureAccuracy (%)
Autoencoder71.98
Convolutional neural network71.68
Restricted Boltzmann machine86.68
Table 5. Average classification results for an RBM with a hidden layer using the AAhydro set.
Table 5. Average classification results for an RBM with a hidden layer using the AAhydro set.
layer 1 (#Neurons)Accuracy (%)
10088.11
20088.48
30088.45
50089.33
80087.98
Table 6. Average classification accuracy results for an RBM with two hidden layers using the AAhydro set.
Table 6. Average classification accuracy results for an RBM with two hidden layers using the AAhydro set.
Layer 1 Layer 2
100200300500800
10090.6288.2488.6187.9088.45
20088.2488.2488.0387.9888.28
30088.6188.0388.1987.8288.03
50087.9087.9887.8289.7187.98
80088.4588.2888.0387.9887.82
Table 7. Average classification accuracy results for an RBM with three hidden layers using the AAhydro set.
Table 7. Average classification accuracy results for an RBM with three hidden layers using the AAhydro set.
Layer 1 , Layer 2 Layer 3
300500800
300, 30079.0677.2378.32
300, 50077.9677.2778.32
500, 30076.5078.6977.23
500, 50077.9681.2477.96
300, 80083.0783.0781.24
800, 30083.4382.7083.06
800, 80078.6979.4279.42
500, 80081.9782.7081.97
800, 50080.1578.3283.07
Table 8. Average classification results for an RBM with four hidden layers using the AAhydro set.
Table 8. Average classification results for an RBM with four hidden layers using the AAhydro set.
Layer 1 , Layer 2 , Layer 3 , Layer 4 Accuracy (%)
300, 300, 300, 30082.70
500, 500, 500, 50082.34
800, 800, 800, 80081.24
Table 9. Average classification accuracy results for an RBM with two hidden layers using a fine grid search around 500 neurons and the AAhydro set.
Table 9. Average classification accuracy results for an RBM with two hidden layers using a fine grid search around 500 neurons and the AAhydro set.
Layer 1 Layer 2
400450500550600
40081.2481.9781.6181,6081,60
45084.1677.9679.4181.6080.51
50083.8080.5189.7181.6079.41
55079.7879.0579.7881.2482.70
60082.7079.7879.0579.0581.24
Table 10. Highest average accuracy results of the restricted Boltzmann machine (RBM) with two hidden layers over 531 amino acid property indices. MCC, Matthews’ correlation coefficient; BER, balanced error rate.
Table 10. Highest average accuracy results of the restricted Boltzmann machine (RBM) with two hidden layers over 531 amino acid property indices. MCC, Matthews’ correlation coefficient; BER, balanced error rate.
NameIndexAccuracy (%)MCCBER
Hydrophobicity index53192.8691.127.71
Principal eigenvector of contact matrices and hydrophobicity profiles48592.8291.0210.33
Frequency of occurrence in beta-bends16692.4090.4110.69
Distinct character in hydrophobicity of the amino acid composition19391.9389.839.88
Weights for coil at the window position of −228891.8189.6811.78
NMR chemical shift of the alpha-carbon8491.7689.6813.34
AA composition of EXT2of single-spanning proteins20591.7289.6411.69
Relative mutability6591.6089.5010.52
Protein surface amino acid compositions47191.5189.3911.39
Hydrophobic packing and spatial arrangement of amino acids24790.9788.909.89
Proportion of residues 95% buried3590.4288.269.70
Normalized van der Waals volume8090.2987.9610.60
Table 11. Highest accuracy results of RBM using the amino acid property two-index combinations from Table 10.
Table 11. Highest accuracy results of RBM using the amino acid property two-index combinations from Table 10.
IndexAccuracy (%)MCCBER
65-20593.9192.348.28
35-6593.5391.889.12
84-19393.4591.776.96
247-16693.1191.329.52
84-20593.0391.2811.55
Table 12. Highest accuracy results of RBM using the amino acid property three-index combinations from Table 10.
Table 12. Highest accuracy results of RBM using the amino acid property three-index combinations from Table 10.
IndexAccuracy (%)MCCBER
485-247-19394.0892.675.18
247-80-16692.8291.119.48
485-65-16692.7390.9411.16
247-166-47192.6990.868.18
35-80-47191.6489.737.87
Table 13. Comparison of RBM results with conventional classification methods using one, two and three amino acid property index combinations.
Table 13. Comparison of RBM results with conventional classification methods using one, two and three amino acid property index combinations.
IndexSVMDTk-NNRBM
CombinationAccuMCCBERAccuMCCBERAccuMCCBERAccuMCCBER
53190.9987.8011.7787.1682.0715.7989.4186.2511.4392.8691.127.71
65-20591.2488.0411.7388.5783.7813.6590.4987.7110.3093.9192.348.28
485-247-19390.7487.3412.9688.3283.7812.9790.3387.6510.3394.0892.675.18

Share and Cite

MDPI and ACS Style

Cruz-Barbosa, R.; Ramos-Pérez, E.-G.; Giraldo, J. Representation Learning for Class C G Protein-Coupled Receptors Classification. Molecules 2018, 23, 690. https://doi.org/10.3390/molecules23030690

AMA Style

Cruz-Barbosa R, Ramos-Pérez E-G, Giraldo J. Representation Learning for Class C G Protein-Coupled Receptors Classification. Molecules. 2018; 23(3):690. https://doi.org/10.3390/molecules23030690

Chicago/Turabian Style

Cruz-Barbosa, Raúl, Erik-German Ramos-Pérez, and Jesús Giraldo. 2018. "Representation Learning for Class C G Protein-Coupled Receptors Classification" Molecules 23, no. 3: 690. https://doi.org/10.3390/molecules23030690

Article Metrics

Back to TopTop