**1. Introduction**

Plasmodium falciparum are a unicellular protozoan organisms and toxic species that cause malaria in humans. It degrades hemoglobin in the acidic environment provided by the food vacuole [1]. When a female anopheles mosquito attacks human, malaria infection begins in the form of sporozoites into the bloodstream and its life cycle adopts many different stages [2]. These sporozoites are then quickly passed into the human liver where they upsurge exponentially into their cells to form merozoites. Merozoites attack red blood cells (erythrocytes) and again multiply until the cells burst to become trophozoites, schizonts, and gametocytes, during the last three stages of the anopheles life cycle as shown in Figure 1.

**Figure 1.** When a mosquito bites a human (host for malarial parasite) it causes infection by injecting sporozoites into the body, where it adversely affects hepatocyte's shape. Sporozoites grow rapidly in hepatocytes to become merozoites, while merozoites grow rapidly causing hepatocytes to burst and infect neighboring hepatocytes. When a mosquito bites the malaria patient the gametocytes that are produced from merozoites are taken by a mosquito. For the next 10 to 14 days the gametocytes produce sporozoites which are transferred to the saliva gland waiting for a mosquito to bite a healthy person and cause infection.

In eukaryotic cells, a mitochondrion is a membrane-bound organelle found in the cytoplasm which acts as a powerhouse of the cell responsible for cellular respiration and the production of adenosine triphosphate [3]. The inner membrane of cytoplasm comprises of different proteins such as enzymes, which are required for biochemical reactions. The mitochondrion has its own DNA (deoxyribonucleic acid) and ribosomes, which are 70 percent as that of prokaryote cells. In a cell, the mitochondrion is one of the important organelles which controls cellular metabolism and produces energy. Biologists have revealed that there are no significant similarities between mitochondrial proteins and human homologs [4].

Considering the constructive role of mitochondrial protein sequences in bioinformatics, proteomics, and cellular biology, many researcher's interest has been redirected to identify these biological sequences, but still it has been a challenging problem for them. With the invention of modern sequencing technologies, the number of these proteins has increased with rapid acceleration in the protein databanks. In 1990, only 3939 protein sequences were reported in the Uniprot database. According to the recent release statistics of protein databank, this number reached 550,000 in 2019 (11 December) [5].

There are two main approaches followed by researchers in the protein sequence prediction domain. The first category include machine learning-based approaches, where they employ features extractions methods in order to extract various patterns from biological sequences. Most of the existing literatures followed this approach. The second one is deep learning-based approach which extracts deep features and contextual information from proteins, which improves the prediction accuracy significantly. We briefly discuss the related works for both the mentioned approaches.

#### *1.1. Machine Learning Approach towards Mitochondria Proteins Identification*

In past decades, numerous machine learning algorithms and computational biological techniques are proposed for the categorization of mitochondrial and non-mitochondrial proteins via complex sequences. Bender et al. [6] evaluated MPPF by principal component analysis, statistical methods, and supervised neural network. They developed a model PlasMit based on extracting new composition patterns from proteins which efficiently predicted the mitochondrial proteins. R Verma et al. [7] combined two feature descriptors i.e., split amino acid and position specific scoring metrics, in order to predict mitochondrial proteins accurately. Jia et al. [8] considered mitochondrial proteins as attractive targets for anti-malarial drugs, but manual identification of these proteins is a difficult and time-consuming task. Therefore, they used two proteins encoding approaches such as bi-profile Bayes and split amino acid composition in order to extract specific pattern features from the amino acid chain. Authors trained a support vector machine classifier on these statistical features for final prediction. Afridi et al. [9] proposed genetic programming and an ensemble approach based on the feature extraction method for mitochondrial protein classification. Ding et al. [10] analyzed the variance for the accurate prediction of mitochondrial proteins. They suggested that combining more and more features is not a reliable approach because it takes more time in execution and contained redundant values which degraded the performance of the model. Therefore, to reduce the dimensionality and select the optimal features, researchers in this article used analysis of variance (ANOVA). Due to the complexity of the Plasmodium falciparum genome, the prediction of MPPF is more difficult than other species. Chen et al. [11] proposed an n-peptide composition of reduced amino acid alphabet which is obtained from a structural alphabet named protein as a feature parameter; the increment of diversity was firstly proposed to predict mitochondrial proteins. For instance, Cai et al. [12] applied support vector machine (SVM) based algorithm to train a predictor on three types of feature descriptors technique including transition (T), composition (C), and distribution (D) for obtaining additional physicochemical properties of different amino acids. R Kumar et al. [13] proposed a two-level model named as SubMitoPred. In the first level, authors predicted mitochondrial proteins while in the second level, they forecasted the sub-classes of mitochondrial localization. The whole model was based on the combination of SVM and the Pfam information domain. For further improvement, C Savojardo et al. [14] developed the deep learning model DeepMito. They trained and tested the model on a new high-quality dataset. Furthermore, they also developed a webserver for predicting mitochondrial and sub-mitochondrial localization. DNA-binding proteins (DBP's) can be used for the regulation of transcription and gene expression along with the identification of particular nucleotides. Therefore, for accurate and precise predictions of DBP's, Waris et al. [15] used evolutionary profiles position-specific scoring matrix for sequence encoding and a support vector machine for classification. Most of the available drugs are prepared to target the membrane proteins. Discriminating these proteins via computer vision techniques is an effective and timesaving as well. Therefore, Hayat et al. [16] through efforts of a computational method, accurately predicted the membrane proteins via different machine learning algorithms.

#### *1.2. Deep Learning Approach towards Mitochondria Proteins Identification*

Some researchers have utilized the full benefits of deep learning techniques and employed them to predict different types of protein sequences [17–19]. Delong et al. [20] attempted for the first time, to generate the original idea in deep learning for the discrimination of DNA binding proteins and non-DNA binding proteins. Qinhu et al. [21] improved the inherent weak supervision biological information prediction by proposing a new procedure established on CNN features with sequence-based learning to classify the DNA binding proteins. Recently, for sequencing learning, Qu et al. [17] applied a sequence learning network known as a recurrent neural network (RNN) with CNNs to predict those proteins which are attached to DNA. Compared to the traditional methods, deep learning techniques enhance the flexibility of extracted optimal features from sequences. It is not only the selection of a large number of proteins that are made possible for model training, but the process of speedy and accurate prediction is also enhanced. From earlier scholar's work, in addition to protein features, contextual information has also been observed as a valuable feature [22]. With an inspiration of this concept, if an amino acid sequence also suppresses contextual features, then it might increase the prediction score.

In this article, a CNN model is proposed by employing MBD-LSTM for Plasmodium mitochondria protein identification. In the first encoding layer, we assign an integer value to each amino acid in order to encode protein sequences and find the maximum length of the protein, and the next layer is the embedded layer where each word is converted to fix the length vector. Further, these matrices

are passed through the CNN model containing three convolutional layers followed by max pooling layers. Finally, these deep features are fed into the MBD-LSTM layer for sequence learning and the last dense layer generates the optimal output. The proposed novel framework makes the following main contributions for the identification of Plasmodium falciparum parasite mitochondrial proteins.


The remaining article is divided into three further sections; Section 2 briefly explains the proposed system. Results and discussion are explained in Section 3 and in Section 4 we present the conclusions and future directions.
