*3.1. Datasets*

The benchmark dataset of Plasmodium mitochondrial protein sequences was obtained from [5]. The total number of proteins in dataset is 175, which contains 40 positive (mitochondrial proteins) and 135 negative (non-mitochondrial proteins) samples. We represent this dataset by Plasmodium falciparum 175 (PF175). The second raw dataset (PF2095) was obtained from the universal protein resource (Uniprot) which contains 890 positive samples and 1205 negative samples.

Similarly, a third dataset was downloaded from [34] which contained 499 positive samples. In this paper, we denote this dataset by (MPD) as shown in Table 2. Actually, 2833 proteins were obtained from the protein databank site known as Swiss-Prot by searching the keyword mitochondrial. Afterward, those proteins were then excluded with ambiguous words, such as SIMILARITY, POTENTIAL, or PROBABLE and FRAGMENTS. Furthermore, 681 proteins were collected belonging to locations other than mitochondrial site.


**Table 2.** Plasmodium falciparum mitochondria protein datasets.

By applying the preprocessing and eliminating the ambiguous data, we selected 250 proteins as non-mitochondrial.
