Network Intrusion Detection Based on Amino Acid Sequence Structure Using Machine Learning

Ibaisi, Thaer AL; Kuhn, Stefan; Kaiiali, Mustafa; Kazim, Muhammad

doi:10.3390/electronics12204294

Open AccessArticle

Network Intrusion Detection Based on Amino Acid Sequence Structure Using Machine Learning

¹

School of Computer Science and Informatics, De Montfort University, Leicester LE1 9BH, UK

²

Institute of Computer Science, University of Tartu, 50090 Tartu, Estonia

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2023, 12(20), 4294; https://doi.org/10.3390/electronics12204294

Submission received: 15 September 2023 / Revised: 8 October 2023 / Accepted: 13 October 2023 / Published: 17 October 2023

(This article belongs to the Special Issue Machine-Learning-Enabled Big Data Analysis: Advancements, Applications and Challenges)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The detection of intrusions in computer networks, known as Network-Intrusion-Detection Systems (NIDSs), is a critical field in network security. Researchers have explored various methods to design NIDSs with improved accuracy, prevention measures, and faster anomaly identification. Safeguarding computer systems by quickly identifying external intruders is crucial for seamless business continuity and data protection. Recently, bioinformatics techniques have been adopted in NIDSs’ design, enhancing their capabilities and strengthening network security. Moreover, researchers in computer science have found inspiration in molecular biology’s survival mechanisms. These nature-designed mechanisms offer promising solutions for network security challenges, outperforming traditional techniques and leading to better results. Integrating these nature-inspired approaches not only enriches computer science, but also enhances network security by leveraging the wisdom of nature’s evolution. As a result, we have proposed a novel Amino-acid-encoding mechanism that is bio-inspired, utilizing essential Amino acids to encode network transactions and generate structural properties from Amino acid sequences. This mechanism offers advantages over other methods in the literature by preserving the original data relationships, achieving high accuracy of up to 99%, transforming original features into a fixed number of numerical features using bio-inspired mechanisms, and employing deep machine learning methods to generate a trained model capable of efficiently detecting network attack transactions in real-time.

Keywords:

intrusion detection; Amino acid structural properties; machine learning; Amino acid encoding; pattern recognition; sequence-to-structure mapping; Amino acid signatures

1. Introduction

Network-Intrusion-Detection Systems (NIDSs) play a pivotal role in safeguarding computer networks against a variety of attacks, ensuring the security of vital network infrastructure. These systems are typically classified into two main types: Misuse NIDSs employ signature-based techniques by identifying established attack patterns, resulting in rapid Detection Rates (DRs) and minimal False Positive Rates (FPRs) [1]. This makes them well-suited for real-time detection scenarios. On the other hand, Anomaly NIDSs, while having a higher false positive rate, excel at identifying attacks by detecting unusual user behavior or patterns. They identify deviations from the normal network traffic profile and flag them as potential attacks, making them adept at identifying novel threats across networks [2].

This paper introduces a novel type of Anomaly NIDS, based on encoding network traffic as Amino acid chains. This is performed by transforming plain-text into Amino acid representations, aiming for more-efficient pattern recognition within network transactions. The proposed novel method harnesses the bio-inspired operations and structural properties of Amino acid chains to enable the identification of suspicious network transactions and potential attack signatures. The hypothesis is that using Amino acid encoding and structural features in NIDSs is beneficial because of the following reasons:

Unique properties: Amino acids have special qualities, and their sequences can help us find complex patterns and connections in data.
Compact and efficient: Amino acid sequences are a smart way to represent data because they are compact and do not take up much space. Unlike traditional numbers or binary codes, Amino acid sequences are concise.
Long-living and secure: By making our Amino acid sequences similar to natural ones, they can even be turned into DNA sequences, making use of their resilience against environmental changes and secure storage options. This future direction opens up the possibility of leveraging the remarkable data storage capabilities of natural DNA.

Based on these hypotheses, we examined if the structural features of Amino acid sequences for encoded network transactions can accurately identify potential attacks. The proposed method outperformed other methods from the literature and achieved 99% accuracy running on the same network transaction dataset. We compared our work to similar approaches used in the literature (as discussed in Section 2) with the aim to address the shortcomings and get better results.

The suggested method involves constructing a database of Amino acid signatures for well-known attacks. There are 20 essential Amino acids (Figure 1) in nature. Each Amino acid has a unique chemical structure that gives it specific properties [3]. The Amino acid signatures represent encoded instances of network transactions using essential Amino acid labels that encode numerical structural properties using a Vigesimal numbering system to map ASCII codes to Amino acids. Then, a Neural Network model was created and trained on our Amino acids database, learning common patterns akin to attack signatures. This was achieved by capitalizing on similarities in Amino acid sequence structures and functional domain areas to cluster similar attacks. The end result is a trained model capable of identifying suspicious network transactions in real-time with minimal processing time and an acceptable false positive rate. The construction of this attack classification model encompasses the following steps:

Pre-process network transactions to prepare them for Amino acid mapping.
Map the processed data into Amino acids.
Generate structural properties for each Amino acid signature.
Train the deep classifier model using the training data subset.
Validate the correctness and accuracy of the model by applying it to the testing data subset.

The rest of the paper is organized as follows: Section 2 delves into the related work, offering insights into prior research in the field. Section 3, on the other hand, is dedicated to introducing the foundation of our study. It begins by presenting the source dataset (Section 3.1) and describing the data preprocessing steps in Section 3.2, followed by an exploration of Amino acid mapping (Section 3.3). Additionally, the structural properties of Amino acid sequences are discussed in Section 3.4.

The presented research methodology further encompasses the exploration of Neural Networks (Section 3.5) and a thorough explanation of the deep learning model utilized (Section 3.5.1). Furthermore, insights into the hyperparameters of the designed deep learning model are provided in Section 3.5.2.

Section 4 showcases the outcomes of the experiments performed, while Section 5 offers a comprehensive discussion of the results. In conclusion, Section 6 summarizes the findings and outlines potential avenues for future research.

2. Related Work

Several research studies have proposed diverse Deoxyribonucleic-Acid (DNA)-encoding techniques, aiming to address various challenges in digital computing and network domains, particularly in the context of network intrusion detection. The genetic information within DNA is structured as a code formed by four chemical bases: adenine, cytosine, guanine, and thymine, denoted as A, C, G, and T. DNA encoding involves the conversion of regular text into a DNA sequence, which can further be translated into a representation of Amino acids.

Suyehira [4] proposed an encoding and decoding algorithm tailored for a DNA-based data-storage system. The proposed encoding method offers a means to convert binary data into sequences that represent DNA strands. Importantly, this method takes into consideration the inherent biological limitations. The researcher introduced a mapping framework and translation process that translates hexadecimal data into codons, factoring in biological intricacies such as the exclusion of start codons, the prevention of repetitive nucleotides, and the avoidance of lengthy repetitive sequences. In this schema, each hexadecimal value is transmuted into a codon, which is a sequence of three nucleotides that encodes a specific Amino acid. This approach is inspired by nature, mirroring the way Ribonucleic Acid (RNA) strands are parsed by ribosomes in groups of three nucleotides. The use of hexadecimal characters grants the algorithm the versatility to encode various data forms, leveraging their binary representation. The researcher hypothesized that each hexadecimal character can be flexibly associated with one of several possible codon choices (randomly chosen). This innovation serves as a specific technique within the proposed encoding algorithm. In cases where the algorithm encounters difficulty in finding a valid codon to continue a sequence, a technique called backtracking is suggested. This involves reassigning a new codon for the previous hexadecimal character.

What sets this research apart is the novel DNA-encoding approach, characterized by two significant modifications. First, it employs hexadecimal representation instead of text. Second, the introduced mapping scheme takes into account biological constraints, which ultimately aiding in the subsequent translation process, guiding the conversion of codons into Amino acids.

Rashid [5] introduced a method aimed at encoding various network packet attributes with utmost efficiency, using the fewest-possible characters. This method employs a four-character DNA-encoding strategy, referred to as DEM4all, which comprehensively represents all 41 attributes. The DEM4all-encoding technique utilizes four randomly chosen DNA characters to correspond to the complete spectrum of 96 potential values for network traffic attributes. The selection of four characters proves optimal for handling and symbolizing the full range of conceivable values. Importantly, due to its random nature, every execution generates distinct DNA sequences to depict each value. The choice of these four letters is strategic, as they have been carefully selected to encompass the widest range of attribute values. It is worth noting that the DNA representation is generated randomly during each execution, without drawing from any biological basis.

A new DNA encoding for misuse IDS based on the UNSW-NB15 dataset was proposed in [6]. The proposed system is performed by building a DNA encoding for all values of 49 attributes. Then, attack keys (based on attack signatures) are extracted, and finally, the Raita algorithm is applied to classify records, as either attack or normal, based on the extracted keys. The Raita algorithm was published by Timo Raita in 1991. It is a string-searching algorithm that improves the performance of the Boyer–Moore–Horspool algorithm. Thus, for nominal attributes that have 151 unique values (corresponding to the total number of values of the protocol, service, and state attributes), four DNA characters are used that can handle all these values. For numerical attributes with 11 values (from 0 to 9 and a fraction point), two DNA characters are used, which can handle all these values to represent each digit separately. This paper used random DNA representations to encode the network packet attributes that have no biological basis.

Cho et al. [7] delved into the potential adaptation and enhancement of a sequence-alignment technique prevalent in bioinformatics. The authors presented an advanced rendition of the Needleman–Wunsch algorithm, a tool for global sequence alignment, tailored to identify intrusions within the IoT landscape. This novel method incorporates a positional Weight Matrix to bolster its capabilities. A Weight Matrix is a strategy used to hunt for patterns within biological sequences. The assigned weights correspond to the frequency of occurrence of specific sequence elements in distinct positions.

Rashid et al. [8] introduced two distinct DNA-encoding techniques known as DEM3sel and DEMdif, each possessing unique characteristics pertaining to the length of the DNA sequence and the manner in which network traffic is represented. Specifically, DEM3sel employs three characters to signify all 41 network attributes, while employing a single predetermined character to differentiate between nominal and numerical attributes. In contrast, DEMdif adopts diverse characters to represent network attributes based on their values and likewise utilizes a solitary pre-established character to distinguish between nominal and numerical attributes.

The study made use of the KDDCup’99 and NSL-KDD datasets, both of which are computer network traffic datasets. The attributes of the network traffic records were categorized into two groups: Nominal attributes (Protocol, Services, Flag) and Numerical attributes. Although both methods encompass fixed and dynamic elements, it is important to note that the choice of DNA representation is made randomly and is not guided by any biological reference.

Rashid et al. [9] introduced a hybrid Network-Intrusion-Detection System (NIDS) that harnesses both DNA encoding and clustering techniques. In their proposed DNA encoding scheme, attributes from the UNSW-NB15 database (comprising network traffic records) are categorized into four groups: State, Protocol, Service, and Digits for the remaining attributes. The dataset encompasses various types of attacks and regular activities, with each entry comprising 49 features. These attributes are subsequently translated into DNA sequences via DNA encoding. Notably, each protocol attribute value is represented using a combination of four DNA elements. In contrast, the values of the State, Service, and Digit attributes are encoded using two DNA characters each. Following this encoding step, a clustering approach is employed to classify records into either attack or normal clusters. It is important to highlight that the chosen DNA sequences for attribute values exhibit varying sizes based on volume considerations and are tailored specifically without invoking biological references.

Cevallos et al. [10] elucidated the steps involved in utilizing DNA as a storage medium, particularly focusing on the process of encoding digital information into DNA sequences. This process entails the conversion of a text file into a DNA sequence. The transformation happens in three distinct stages. Initially, each character in the text file is substituted with its corresponding ASCII value, subsequently represented in binary format. Finally, every two bits of the binary representation are further replaced by the corresponding DNA bases: A = 00, C = 01, G = 10, and T = 11. Furthermore, this study delved into the intricate relationship of nucleotide bonds within a biological context, providing a comprehensive mapping table that facilitates the encoding of Amino acids via binary representation.

In recent years, the field of bioinformatics has witnessed a shift towards utilizing Amino acid mapping in text encoding, departing from the more-traditional approach of DNA mapping [11]. Several key advantages have emerged from this shift, demonstrating the effectiveness and versatility of Amino acid mapping in various applications.

Some advantages of Amino acid encoding over DNA encoding are as follows:

Increased information density: Amino acid mapping encodes text sequences with greater information density compared to DNA mapping. This is due to the fact that Amino acids achieve a more-compact representation than DNA encodings [12].
Enhanced similarity measures: Amino acid mapping often results in improved similarity measures for text sequences. By preserving functional and structural similarities, it enables more-accurate comparisons between sequences, which is valuable in tasks such as sequence alignment and similarity searching [13].
Robustness to mutations: Amino acid mapping is more robust to minor variations or mutations in text sequences. DNA mapping, on the other hand, can be highly sensitive to single-character changes, making Amino acid mapping a preferred choice in scenarios where text data may be noisy or subject to minor alterations [14]. For example, proline is represented by four different codons (CCU, CCC, CCA, and CCG)
Compatibility with Amino acid sequence analysis tools: Amino acid mapping facilitates seamless integration with a wide range of bioinformatics tools designed for biological analysis. This compatibility enables researchers to leverage existing resources and tools in their text analysis tasks [15].
Support for structural information: This can be particularly advantageous in applications since Amino acids are more relevant to biological processes than DNA nucleotides [16]. DNA nucleotides are the building blocks of genes, but genes do not directly determine the structure and function of Amino acid sequences. The sequence of Amino acids determines its structure and function. This is the main reason to use Amino acid mapping in our research.
Reduced sequence length: Amino acid mapping often results in shorter encoded sequences compared to DNA mapping. This reduction in sequence length can lead to computational efficiency improvements in various text-processing tasks besides a reduction in the data storage needed [17].
Alignment flexibility: Amino acid mapping allows for more flexible sequence alignments. It accommodates gapped alignments, which are essential in scenarios where text sequences may have insertions or deletions [18].

These advantages collectively demonstrate the utility of Amino acid mapping in handling text sequences, making it a valuable choice in various bioinformatics and text analysis applications compared to DNA mapping. Those advantages have not yet been deployed for NIDSs, where DNA encoding was predominantly used. We, therefore, examine the use of Amino acid encoding for NIDSs in this paper.

3. Materials and Methods

The proposed Network-Intrusion-Detection System (NIDS) is divided into two phases: training and validation; production. Its implementation consists of four main steps: data preparation, feature generation, machine learning, and model evaluation and production.

Figure 2 shows the Network-Intrusion-Detection System (NIDS) in more detail. Initially, we under-sampled the UNSW-NB15 dataset. The next step was to extract features from the network traffic records based on the generated Amino acid signatures. The extracted features were then used to train a machine learning model. Once the machine learning model was trained, it was evaluated on a test dataset. The evaluation results (evaluation not shown in the diagram) were used to determine the accuracy of the model and to identify any areas where the model can be improved. The final step was to deploy the NIDS in the network.

3.1. Dataset

To devise an efficient technique for detecting network intrusions, it is crucial to have access to comprehensive and contemporary network flow datasets encompassing both regular and anomalous network traffic instances. These datasets should accurately represent real-world scenarios, focusing on pertinent attributes, to enable effective detection of network attacks. Such datasets should encompass diverse instances of attacks and intrusions encountered in network operations. However, a gap persists in the availability of authentic, up-to-date datasets that encompass the full spectrum of network traffic patterns associated with modern technology.

Traditional benchmark datasets like KDDCup’99 [19] and NSL-KDD [20] have been extensively utilized to assess the accuracy of network attack detection. However, these datasets have become outdated due to the swift evolution of network technologies and the emergence of novel cyber security threats and types of network attacks.

A network intrusion dataset comparison by Robertas et al. [21] showed that UNSW-NB15 [22] is one of the most-suitable datasets for our experiments. It is one of the most-recent datasets and has a large number of records with a variety of attack types compared to other datasets (Table 1). UNSW-NB15 originates from the labs of the University of New South Wales (UNSW) Canberra, Australia. This dataset was generated utilizing the IXIA PerfectStorm tool within a confined network environment featuring only 45 unique IP addresses. The data span a concise period of 31 h and encompass a blend of both genuine typical activities and simulated attack behaviors, resulting in 175,341 training records and 82,332 testing records. The IXIA tool simulated nine diverse attack types. The dataset comprises 49 attributes available for analysis, encompassing fundamental features, content-related attributes (derived from packet content), time-based attributes (derived from time-related packet flow characteristics), and additional generated attributes based on statistical connection characteristics.

The choice of using classical datasets is primarily driven by their established reputation and widespread acceptance in the cybersecurity community. These datasets have been extensively used and validated in numerous studies, providing a benchmark for the comparison and evaluation of methods in this field. These datasets continue to be relevant due to the fundamental and timeless nature of the network behaviors they capture. They provide a diverse range of network traffic scenarios, including various types of attacks, which are still prevalent today. Moreover, the encoding of network data into Amino acid sequences is a novel approach. Applying this technique to these classical datasets allows new methods to demonstrate their effectiveness and versatility across different contexts and conditions. Using these well-known datasets (including UNSW-NB15) enables other researchers to replicate our experiments and validate the produced findings easily, promoting transparency and reproducibility in this research. We acknowledge that newer datasets could provide additional insights due to their reflection on recent attack techniques. However, for the scope of this paper, we believe that the chosen dataset sufficiently serves our research objectives.

3.2. Data Preprocessing

We employed the UNSW-NB15 DB dataset, which underwent a thorough examination by Moustafa [22], as our labeled source for network transactions. Our particular focus rested on the disparity between attacks and normal instances. In this dataset, attacks constitute merely 13% of the records, while the remaining 87% are labeled as normal. This imbalance poses a challenge to the classifier model’s ability to generalize effectively during the training phase, particularly if we were to use the complete dataset. Furthermore, this imbalance could introduce a bias towards normal labels, consequently affecting the model’s predictive performance on unseen data.

First, we removed duplicate records, i.e., those having the same sequence. We ended up with nearly 100,000 attack-labeled records. Then, we under-sampled the normal transactions to have a 50/50 ratio with attack labels, producing a total of nearly 200k records.

Under-sampling is a technique that reduces the size of the majority class in an imbalanced dataset by randomly removing some of its examples. This can help to balance the class distribution and make the learning algorithm more sensitive to the minority class [23,24,25].

The major advantages and disadvantages of subsampling are as follows:

Reduced computational complexity: Under-sampling can lead to reduced computational complexity, as it reduces the overall size of the dataset. Smaller datasets can be processed more efficiently, potentially resulting in faster model training and evaluation [26,27].
Improved model performance: Balancing the class distribution through under-sampling can lead to improved model performance, as it prevents the model from being biased towards the majority class. This can result in better classification accuracy and a more reliable evaluation of model performance [24,28].
Reduced over-fitting: Under-sampling can help reduce over-fitting (failure to generalize the model) in machine learning models. When a dataset is imbalanced, models may over-fit the majority class. Under-sampling the majority class helps prevent this issue, leading to models that generalize better to unseen data [29].
Faster training: With a balanced dataset obtained through under-sampling, models often converge faster during training. This is because each class contributes an equal number of samples, allowing the model to learn from each class more effectively [30].
Loss of information: Under-sampling can lead to a significant loss of information, as it involves removing instances from the dataset. This could potentially exclude important or relevant instances, leading to a model that is less accurate or representative.
Increased bias: While under-sampling can help address issues related to class imbalance, it can also introduce its own biases. Removing instances from the majority class could potentially bias the model towards the minority class. This could result in a model that performs poorly on new instances of the majority class.

To avoid model over-fitting, we need to split our modeling dataset (after under-sampling) into training and testing samples. The training part is used for model training and the testing part (unseen) to evaluate model performance. We split the dataset using the Holdout method by dividing it into 2 partitions with 80% for training and 20% for testing. We used K-fold Cross-Validation during the training phase of our classifier model to avoid over-fitting the model. Finally, we validated the correctness and accuracy of the model using the testing data. This was performed by comparing the generated labels from the model prediction method against the original ones.

As part of the data preprocessing, we standardized the dataset by centering each numerical value according to the mean and standard deviation of each feature column.

3.3. Amino Acid Mapping

Since our target domain had 20 Amino acids (as in Table 2), we chose to use the Vigesimal numbering system as the intermediate mapping between ASCII representations of feature values and Amino acids. The Vigesimal system, also known as base-20, is a numeral system where the base is 20. This means that the system uses 20 distinct symbols (digits and letters) to represent numbers. In the Vigesimal system, the first 20 digits are represented as follows:

0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, G, H, I, J

In this system, after reaching digit 9, the symbol “A” represents the number 10, “B” represents 11, and so on, up to “J”, which represents the number 19. The number 20 is represented as “10” in Vigesimal, where “1” stands for 1 times 20 and “0” denotes 0 additional units.

As an example, the following counts from 1 to 30 in the Vigesimal system:

1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, G, H, I, J, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1A, 1B

Table 2 shows the vigesimal values corresponding to each Amino acid.

The steps in the encoding process are as follows:

Replace missing values by NA;
Capitalize all letters (i.e., “tcp” $\Rightarrow$ “TCP”, “dns” $\Rightarrow$ “DNS”);
Convert each literal into its ASCII value (i.e., “A” $\Rightarrow$ 65);
Convert the ASCII value into its assigned Amino acid using the modulo 20 operation.

In Table 3, examples of the process are given. Each step showcases the transformation of network-related data into a representation based on Amino acids, facilitating the comprehension of the encoding process samples of Amino acid encoding steps. Finally, all Amino acids resulting from one row of values will be appended in one sequence to represent the row Amino acid signature (Table 4).

Each row within Table 4 showcases the distinctive signature produced by concatenating the encoded Amino acid representations of the respective columns. From those examples, it is evident that the encoding process encapsulates the underlying patterns and characteristics of the input data in a condensed manner. This table, thus, exemplifies how the representation of data using Amino acids can yield succinct, yet information-rich signatures, enabling efficient and effective pattern recognition within network transactions. These artificial Amino acid sequences, while valuable for testing and research purposes, often differ from their natural counterparts. In Table 5, we delve into the contrasts between these two types, exploring how our artificially generated Amino acid sequences differ from the intricate structures and functions of natural sequences.

Encoding network transactions into Amino acid sequences can greatly impact both the efficiency and effectiveness of analytical processes. Amino acids possess unique properties, enabling them to capture intricate patterns and relationships within data. They offer a highly compact and efficient way to represent data. Unlike traditional numerical values or binary encoding, which can be verbose and require significant storage space, Amino acid sequences provide a concise representation. By adjusting the encoding method that produces artificial Amino acid sequences, it can create sequences with similar structure and properties to natural Amino acids. Furthermore, it can be translated into DNA sequences, leveraging natural DNA’s exceptional data-storage capabilities, which also offers robust data security.

3.4. Feature Transformation: Structural Properties of Amino Acid Sequences

Most network transactional datasets have a large number of features to represent network and packet status [31]. Since we have transformed the network data into an Amino acid sequence, we now use the Amino acid’s characteristics to describe the network data. In biology, different characteristics related to Amino acid structure are important in understanding their shape, function, and behavior. The structural properties of Amino acids are used as an analytical tool to understand how Amino acid sequences behave. Amino acids have a variety of features, including size, shape, and charge. These features affect how Amino acids interact with each other and how they fit together to change their structure. Some Amino acids are larger in shape than others, which affects how they fit together in a sequence. Amino acids also have different shapes such as curved, straight, or twisted. In addition, the charge of each Amino acid may be positive, negative, or neutral. Considering all these properties of Amino acids would give each sequencea unique identity to categorize it as attack or benign.

We can use these unique traits of Amino acids to identify potential issues or threats in network traffic. For example, if the Amino acid sequence of a network transaction has many twists (

β

turn value), it might be a sign of unusual network activity. Similarly, if the sequence has a specific Amino acid pattern (

β

strand value), it could be a sign of a known attack that has a similar and known sequence pattern. For each sequence of Amino acids, we calculated ten numerical structural properties that reveal different aspects of the Amino acid sequence structure and behavior. The structural properties of Amino acids reflect how the sequence folds, interacts, and performs its biological role. Using those properties is part of our contribution made by this research since it is a way to transform all network transactional features regardless of their variable count and data type into numerical features and still preserve the original data relationship through those transformed features. More information on how those features are calculated can be found in the ssbio documentation [32]. The structural properties used were as follows:

$β$ turn: A $β$ turn is like a U-turn in the Amino acid sequence structure that affects the sequence’s functionality. The $β$ turn value is the percentage of Amino acids able to form that type of turn. It is calculated as:

$β Turn = \frac{m}{n}$

(1)

where:
- m is the count of predefined pattern occurrences of consecutive four-character Amino acids ( $β$ turn tetrapeptides) within the sequence.
- n is the total number of Amino acids in the sequence.
$β$ strand: A $β$ strand is a pattern of the Amino acid sequence that is arranged in a flat, extended shape, almost like a ruler. The $β$ strand value is the percentage of Amino acids able to form that type of strand. It is calculated as:

$β Strand = \frac{k}{n}$

(2)

where:
- k is the count of the predefined pattern occurrences of consecutive four-character Amino acids (β strand tetrapeptides) within the sequence.
- n is the total number of Amino acids in the sequence.
Molecular weight: The molecular weight of the Amino acid sequence offers valuable clues about its dimensions and composition. The molecular weight has an impact on physical attributes and interactions. It is calculated as:

$Molecular Weight = \sum_{i = 1}^{n} Mass (i)$

(3)

where:
- n is the total number of Amino acids in the sequence.
- Mass(i) is the mass of the ith Amino acid.
Aromaticity: Aromaticity indicates the occurrence of aromatic Amino acids in the sequence. These aromatic building blocks play a role in maintaining Amino acid sequence stability and frequently engage in binding interactions. It is calculated as:

$Aromaticity = \frac{q}{n}$

(4)

where:
- q is the count of Amino acids in the sequence that are aromatic (the group of Amino acids that possess a characteristic ring-like structure like F, W, or Y).
- n is the total number of Amino acids in the sequence.
Instability index: The instability index measures how prone an Amino acid sequence is to denaturation or clustering. A lower instability index indicates higher stability. It is calculated as:

$Instability Index = \frac{\sum_{i = 1}^{n - 1} instability_index_matrix [a_{i}] [a_{i + 1}]}{n}$

(5)

where:
- n is the total number of Amino acids in the sequence.
- $a_{i}$ represents the ith Amino acid in the sequence.
- instability_index_matrix $[a_{i}] [a_{i + 1}]$ is the instability value associated with the dipeptide (two Amino acids linked together by a molecular bond) formed by Amino acids in $a_{i}$ and $a_{i} + 1$
Isoelectric point: The isoelectric point corresponds to the pH where the Amino acid sequence holds no overall electrical charge. It is calculated as:

$Isoelectric Point = \frac{1}{f} \sum_{i = 1}^{f} {pKa}_{i}$

(6)

where:
- f is the total number of ionizable groups in the sequence.
- pKa = $- {log}_{10} (Ka)$ .
- Ka is a measure of the strength of the Amino acid in a solution.
- $p K a_{i}$ is the pKa value of the ith ionizable group.
- The ionizable group in an Amino acid sequence refers to specific functional groups within the sequence that can either accept or donate their positive charge depending on the pH (level of acidity) of the surrounding environment.
$α$ helix: The helical characteristic signifies the abundance of $α$ -helical structural components within an Amino acid sequence. Its value determines the percentage of certain Amino acids that are known to contribute to helical structures within Amino acid sequence. These Amino acids are “VIYFWL”. It is calculated as:

$α helix = \frac{m}{n}$

(7)

where:
- m is the count of the Amino acids V, I, Y, F, W, and L in the sequence.
- n is the total number of Amino acids in the sequence.
Reduced cysteines: This property evaluates the number of cysteine residues in the Amino acid sequence. It is also known as the molar extinction coefficient with reduced cysteines and is calculated as the weighted sum of the molar extinction coefficients of specific Amino acids in the sequence. It is calculated as:

$Reduced Cysteines = \sum_{aa \in {W, Y, C}} P (aa) \cdot ϵ (aa)$

(8)

where:
- $P (aa)$ : percentage of the specified Amino acid in the sequence.
- $ϵ (aa)$ : molar extinction coefficient of the specified Amino acid. It measures how strongly the Amino acid interacts with light due to its unique chemical features.
Disulfide bridges: Disulfide bridges depict the chemical links established between pairs of cysteine residues, playing a role in stabilizing Amino acid sequence configurations, especially in extracellular surroundings. It is calculated as:

$Disulfide Bridges = \frac{1}{2} \sum_{i = 1}^{N - 1} \sum_{j = i + 1}^{N} δ (d_{ij} \leq D)$

(9)

where:
- N is the total number of cysteine residues in the Amino acid sequence.
- $d_{ij}$ is the distance between cysteine residue i and cysteine residue j.
- $δ$ (d) is a function that returns 1 if d is less than or equal to a threshold distance D (indicating that the cysteine residues are close enough to form a disulfide bond) and 0 otherwise.
GRAVY: The Grand Average of Hydropathy (GRAVY) index measures the general hydrophobic or hydrophilic character of Amino acid sequence.It is calculated as:

$GRAVY = \frac{\sum_{i = 1}^{n} H_{i}}{n}$

(10)

where:
- $H_{i}$ represents the hydropathy value of the ith Amino acid. It measures how strongly the Amino acid interacts with water molecules.
- n is the total number of Amino acids in the sequence.

These ten structural characteristics offer a multi-dimensional view of an Amino acid sequence. Table 6 shows an example of this. Therefore, we converted the Amino acid representation into features using those ten characteristics. The following shows the role of each structural property in specifying information about sequence shape:

$β$ turn and $β$ strand capture the secondary structure of the Amino acid sequence, which is determined by the hydrogen bonding patterns between the backbone atoms. These properties indicate how the sequence bends and twists into different shapes, such as loops, coils, and sheets.
Molecular weight and aromaticity capture the size and composition of the Amino acid sequence, which affect its physical attributes and interactions. These properties indicate how heavy and complex the sequence is and how likely it is to contain aromatic rings that can participate in binding interactions.
Instability index and isoelectric point capture the stability and charge of the Amino acid sequence, which influence its solubility and denaturation. These properties indicate how prone the sequence is to unfold or aggregate and at which pH value it becomes neutral of charge. The pH value is a measure of how acidic or alkaline a substance is.
$α$ helix and reduced cysteines reveal important details about how the Amino acid sequence is shaped. It shows how often the sequence forms spiral-like structures, which influence the overall structure and stability of the Amino acid sequence.
Disulfide bridges and GRAVY capture the quaternary structure of the Amino acid sequence, which, with more bridges, can greatly influence the overall shape and stability. GRAVY gives us clues about whether the sequence prefers to stay in water. Both shed light on how much the Amino acid sequence is willing to react, changing its structure.

By using these 10 structural properties as the inputs for the Neural Network, we can represent the Amino acid sequence in a multi-dimensional space that captures its structural diversity and complexity. This can help us train the Neural Network on all result Amino acid structural properties values and predict the class of any new Amino acid sequence based on its structural properties.

We used BioPython [33] and ssbio [34] to calculate those structural properties. BioPython is a widely used Python library for bioinformatics and computational biology. It provides tools for parsing, analyzing, and manipulating biological data, including DNA and Amino acid sequences. ssbio is a Python package that provides a collection of structural systems biology tools. In our research, we used BioPython and ssbio as biological data analysis tools to calculate the structural properties of Amino acid sequences. This is an advantage because:

BioPython and ssbio are designed with a user-friendly interface, making them accessible to researchers with programming skills, but limited expertise in biology. This ease of use can facilitate collaboration between computer scientists and biologists.
BioPython and ssbio offer a robust and well-documented set of functions tailored to biological data analysis.
BioPython and ssbio also have modules for parsing and manipulating sequence data in addition to sequence structural analysis. These modules can help convert biological sequences into numerical values, which can be used as inputs for Neural Networks.

We can interpret the structural values in Table 6 in the context of Amino acid sequence structure. A

β

turn value of 0.0 suggests that the sequence is unlikely to form this type of turn, suggesting the sequence may be more linear or may have other types of turns or structures. The absence of

β

turns could potentially affect the overall structure and functionality that this sequence forms.

The value 0.05 for

β

strand indicates that the Amino acid sequence is 5% likely to form a

β

strand structure. In simpler terms, a

β

strand is like a flat, straight ribbon in the structure of an Amino acid sequence. It is one of the ways that the chain of Amino acids can fold and twist to form a biological structure. The value 0.05 suggests that there is a small chance that the sequence will form this flat, straight ribbon-like structure. In other words, most of the time, the sequence might prefer to fold and twist in other ways.

The molecular weight of the Amino acid sequence is 10,703.86, referring to the total weight of all Amino acids in that sequence. This gives us an idea about how many Amino acids are in the sequence and what types they are. Regarding aromaticity, each Amino acid can have different characteristics. Some are aliphatic, while others are aromatic, meaning they have a special structure that makes them stand out. A value of 0.15 for aromaticity means that about 15% of the Amino acids in our sequence are these special aromatic ones. Aromatic Amino acids help maintain the stability of their sequence structure and often participate in interactions with other molecules.

An Amino acid sequence with a high instability index (like 0.60 in this case) is more prone to changes or denaturation (the disruption or alteration of the natural shape and structure of the Amino acid sequence), which means it can easily lose its shape and function under certain conditions. The isoelectric point specifies the pH value at which the sequence has an equal number of positive and negative charges, making the Amino acid sequence overall electrically neutral. This could potentially affect how it interacts with negatively charged molecules or surfaces.

For the

α

helix, the value of 0.26 suggests that about 26% of the Amino acids in our sequence are likely to be part of a spring-like coil shape. This is a significant portion and indicates the

α

-helical structure is a key feature of this sequence. These coiled regions can help give the overall sequence structure shape and can also play an important role in its function, such as providing sites for other molecules to bind, which increases the change probability of the sequence. Reduced cysteines refers to the number of cysteine residues in our Amino acid sequence that are free and not engaged in bonds. A high value like 5210 (as in our case) means a high possibility for those free residues to engage in new bonds with other sequences and result in a major change in its sequence structure.

For disulfide bridges, the value 5960 means a high number of bridges in the Amino acid sequence, which likely contribute to a highly stable and rigid structure, as each bridge helps to lock the structure into a specific shape. A negative GRAVY score, like −1.82 in our case, indicates that the sequence is highly hydrophilic, meaning it has a high probability of interacting with water molecules by which its structure can be changed dramatically.

We applied this process to the whole preprocessed UNSW-NB15 DB dataset, replacing each network traffic record with the Amino acid sequence as in Table 4, the ten numeric features as in Table 6, and the attack label, which designates whether the transactions are normal or classified as attacks. This compilation constitutes our updated dataset. Table 7 shows the steps for how to transform the original network transaction values into structural features using the BioPython and ssbio libraries to analyze the Amino acid sequence of each network transaction row.

3.5. Neural Network Learning Model

We designed a Feed-Forward Neural Network with the following topology: 10 input nodes, 1 hidden layer with 10 nodes, and 1 output node. The input features are the structural properties of the Amino acid sequences (Table 6). Training and evaluation for each network took nearly 4 h. Table 8 shows the Neural Network parameters used. The activation function, chosen as “Sigmoid”, introduces non-linearity, crucial for capturing intricate patterns. Employing the “binary cross-entropy” loss function quantifies prediction errors, aiding the model’s optimization during training. With 100 Epochs, the network undergoes iterative learning on the dataset. The Batch Size of 5 improves the efficiency by updating parameters in smaller subsets. These choices collectively enhance the Neural Network’s ability to accurately classify network transactions.

Our Neural Network architecture can be depicted as the composition of the input, hidden, and output layers, along with the respective number of nodes in each layer with the directed connectivity (Figure 3).

The Neural Network design with one hidden layer in Figure 3 is a simple example of a Feed-Forward Neural Network. It can be used to solve a variety of problems, such as classification, regression, and clustering. The Neural Network is trained by adjusting the weights of the connections between the neurons. The weights are adjusted so that the network minimizes the error between its predictions and the labels in the training data.

Data standardization techniques impact the performance of the experimental model (Table 9). Each method’s effectiveness is measured by its resulting mean accuracy and standard deviation. During training, the Neural Network from Figure 3, the first method, which does not involve any standardization, yielded a baseline accuracy of 55% with a standard deviation of 8%. In contrast, applying the “Standard Scaler” technique, which normalizes data by setting its mean to 0 and the standard deviation to 1, significantly improved the accuracy to 96.7% with a standard deviation of 6%. Similarly, the “Normal Scaler” approach, adjusting data to match the mean and standard deviation of its column values, achieved an accuracy of 96.9% with a smaller standard deviation of 1%. These results underscore the importance of data standardization in enhancing model accuracy and stability during the training process.

In order to make sure that a Neural Network is indeed the best choice for the problem [35], taking into consideration the insufficiency of a training dataset challenge [36], we examined a number of other machine learning techniques [37]. Results for those are provided in Supplementary Information (SI) Section S1.

3.5.1. Deep Learning Model

We added new hidden layers to our existing Neural Network to make it deeper and to enhance its data pattern recognition ability. Table 10 shows the accuracies achieved with different numbers of hidden layers. For all Deep Neural Network experiments, we scaled the data using a Normal Scaler since it showed in previous Neural Network experiments the best accuracy among other Scalers.

Conducting multiple iterations of the same Neural Network while maintaining consistent input and output layers, as well as employing identical parameters (as shown in Table 8), yet varying the configurations of the hidden layers, can yield diverse performance outcomes (Table 10). “Deep Model 1” features a Neural Network structure with two hidden layers, each containing 10 and 5 nodes, respectively. This configuration achieved an average accuracy of 97.3% with a standard deviation of 30%, demonstrating a competitive performance. In contrast, “Deep Model 2” employs a more-complex architecture with four hidden layers (10 × 15 × 15 × 10 nodes). This increased depth and width resulted in a higher average accuracy of 98.3%, while the standard deviation dropped to 15%, showcasing improved stability. This table underscores the trade-off between network complexity and accuracy, highlighting how a deeper and wider architecture, as demonstrated by “Deep Model 2”, can lead to enhanced performance.

3.5.2. Hyperparameter Tuning

Neural Network hyperparameters are predetermined settings that are established prior to training a model and cannot be learned from the data themselves. These parameters play a pivotal role in shaping the model’s performance and its ability to generalize well. In light of the accuracy outcomes presented in Table 10, we conducted experiments to fine-tune the hyperparameters for Deep Model 2.

To identify the optimal combination of hyperparameters that yields the highest accuracy, we employed hyperparameter-tuning techniques. Specifically, we used Grid Search, a mechanism that automates the process of systematically exploring various hyperparameter values from a predefined set. This exhaustive search allowed us to determine the configuration that best enhances the accuracy of Deep Model 2. The experimental hyperparameter ranges are shown in Table 11. Detailed results for different numbers of folds can be found in Supplementary Information (SI) Section S2.

The hyperparameters examined encompass the “No. of Folds in K-Fold Cross-Validation”, “Batch Size”, and “No. of Epochs”, each playing a significant role in influencing the model’s performance. Notably, the “No. of Folds in K-Fold Cross-Validation” reflects the number of subsets in which the training dataset is partitioned for training and validation. The model’s accuracy varies with different values of k, revealing how a more-comprehensive evaluation across varying subsets can lead to refined performance, as observed in the 20-fold validation, where the accuracy remarkably reached 99.0%.

The “Batch Size” and “No. of Epochs” parameters are crucial in guiding the model’s convergence and generalization capabilities. Higher “Batch Size” values tend to expedite convergence; yet, the experimentation showcased that values around 100 yielded favorable outcomes, indicating a balanced convergence rate while avoiding potential over-fitting. Similarly, “No. of Epochs” impacts the model’s ability to adapt to the dataset, with results indicating that an Epoch count of around 500 attains substantial accuracy without over-fitting concerns.

Furthermore, the effect of these hyperparameters became more-pronounced when considering the scale of the dataset. It is noteworthy that the results were measured on an under-sampled version of the UNSW-NB15 dataset with a 50/50 ratio of attack labels and a total of 200,000 records. This context underscores the model’s practical utility and potential for real-world applications. Overall, this comprehensive analysis of the experimental hyperparameters underlines their intricate interplay in enhancing the deep learning model’s accuracy and robustness in detecting network intrusion activities.

3.6. Software and Hardware

Our results were generated using Ubuntu 22.04.1 LTS (Jammy Jellyfish) running on and eight-core Linux machine with 11th Gen Intel(R) Core(TM) i9-11900K @ 3.50 GH and an Asus GeForce RTX 3090 ROG graphics card. Below, we provide an overview of the experimental development platform and tools used for conducting our research. These tools and libraries were instrumental in the successful execution of our experiments, ensuring the accuracy and reliability of our results.

Experimental development platform:

Python: Version 3.7.15;
Integrated Development Environment (IDE): Spyder Version 5.3.3;
Database: SQLite Version 3.40;
Dataset: UNSW-NB15 dataset files were used in Excel format.

Python development platform:

Anaconda: Version 3.

External libraries:

BioPython: Version 1.77;
ssbio [38]: Version 0.9;
scikit-learn (sklearn) [39]: Version 1.3;
Keras: Version 2.6;
TensorFlow: Version 2.6.

4. Results

Table 12 provides a summary of key performance metrics for our Neural Network model’s performance in classifying between benign (0) and attack (1) instances during the validation phase using the testing data, which were 20% of the total data. It shows that the model achieved a good precision, recall, and F1-Score for both classes, with an overall accuracy of 97%, a precision of 98%, a recall of 97%, and an F1-Score of 97%. We used 199,286 data rows in total, with a 50/50 ratio between the benign and attack rows and an 80% training/20% testing ratio. The weighted average for the precision, recall, and F1-Score was 97%.

Table 13 provides a comprehensive overview of the performance metrics, including the precision, recall, F1-Score, support, accuracy, macro-average, and weighted average, used to evaluate the effectiveness of the model across different classes and overall accuracy.

Table 14 shows the counts from the confusion matrix. There were 97,308 instances (97.77%) that were correctly predicted as benign (true negatives), 2335 instances (2.34%) that were benign, but were incorrectly predicted as an attack (false positives), 3039 instances (2.43%) that represented attacks, but were incorrectly predicted as benign (false negatives), and 96,604 instances (97.63%) that were correctly predicted as attacks (true positives).

The best results using the deep classifier model that we defined in Table 10, using different settings for the hyperparameters from Table 11, are reported in Table 15. The best accuracy was 99%, with Batch Size = 100 and No. of Epochs = 500. With this combination of hyperparameters, the standard deviation of the accuracy was 27%. The other combinations of hyperparameters achieved mean accuracies ranging from 98.84% to 98.95% with standard deviations ranging from 17% to 37%. The lowest mean accuracy of 98.84% was achieved with a 10-fold cross-validation, a Batch Size of 200, and 500 Epochs. With this combination of hyperparameters, the standard deviation of the accuracy was 23%.

In general, a higher accuracy was achieved with a larger Batch Size and a higher No. of Epochs. This is because a larger Batch Size allows the model to learn more from each training iteration, and a higher No. of Epochs allows the model to train for a longer period of time. However, there is a point of diminishing returns where increasing the Batch Size or the No. of Epochs does not significantly improve the accuracy of the model. This is because the model can only learn so much from the data, and after a certain point, the additional training will not yield any significant improvement. The relationship between the accuracy and hyperparameters can vary depending on the specific dataset and the model architecture. This is because different datasets and model architectures have different characteristics, which can affect the optimal hyperparameter settings.

The highest accuracy of 99.0% was achieved with a Batch Size of 100 and 500 Epochs. This suggests that these hyperparameter settings are optimal for the Deep Model 2 architecture and the dataset used in this experiment. The accuracy was still relatively high (98.9%) with a Batch Size of 50 and 500 Epochs. This suggests that a smaller Batch Size can still be effective if the No. of Epochs is increased. The accuracy decreased slightly with a Batch Size of 200 and 500 Epochs. This suggests that a larger Batch Size may not be necessary for the Deep Model 2 architecture and the dataset used in this experiment.

5. Discussion

From Table 16, it can be seen that our research, named “Our Model (Deep Classifier)”, achieved an accuracy of 99.0% using a Sigmoid classifier with 10 features. This is an acceptable improvement over the other results listed in Table 16. It should be noted that Vikash achieved a higher result with 99.37%. The third-best result overall and the nearest to our result was the 98.18% achieved by Peilun et al. using a CNN/LSTM classifier. Other researchers have achieved accuracies ranging from 78.74% to 93.08% using various classifiers and numbers of features. Generally, there are higher accuracies than ours in the literature for other datasets (e.g., Talukder achieved 99.99% using the KDDCup’99 dataset). We do not include those results for reasons of comparability.

To utilize our model in the real-time detection of attack network transactions, we need to:

Export the trained model in a format compatible with the target deployment environment.
Encode transaction values into Amino acids.
Generate structural properties from encoded Amino acid row signatures (Table 6).
Input structural properties into our run-time model to get its classification either as normal or attack transaction.

The detection of viral Amino acid sequences is a task similar to network intrusion detection. Although it is not exactly the same task, it still takes Amino acid sequences and predicts the sequence classification as viral or normal, based on previous training of a model. In Table 17, we list a number of tools for virus detection and compare their accuracy to our work. Our method, with an accuracy of 99.0%, had a higher accuracy than the deep learning tools listed. It outperformed DeepMP, the previous leader, by 5.5%. VirHunter, ViraMiner, EdeepVPP, and CoviDier also had high accuracies, but they were all below 94%. SSSCPreds had the lowest accuracy of the group.

6. Conclusions and Future Work

In this paper, we presented a novel way of encoding network traffic data into Amino acids by utilizing the structural properties of their encoded sequences to generate a fixed number of numerical features that represent attack network transactions. The high efficiency of our method comes from saving the need to apply feature selection and preserving the original data relationships through the generated features. Our results showed an accuracy as high as 99%, which is higher than the vast majority of results reported in the literature. At the same time, our model is advantageous according to the number of included features and model simplicity.

For future work, it is worth investigating the application of feature selection on the original dataset features before encoding into Amino acids. Feature selection methods to be tried include Information Gain (IG) [54], the Analysis Of Variance (ANOVA) test [55], and the Chi-squared (

X^{2}

) statistic test [56]. The exploration of ensemble techniques, combining multiple models that may achieve competitive accuracy and robustness, is another direction for future work. Future work would also explore the application of our method on more-recent datasets.

Supplementary Materials

The following Supporting Information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics12204294/s1, Additional PDF document S1: Supplemenatry material.

Author Contributions

Conceptualization, T.A.I., M.K. (Mustafa Kaiiali), M.K. (Muhammad Kazim) and S.K.; methodology, T.A.I., M.K. (Mustafa Kaiiali), M.K. (Muhammad Kazim) and S.K.; software, T.A.I.; validation, T.A.I.; formal analysis, T.A.I.; investigation, T.A.I.; resources, S.K.; data curation, n/a; writing—original draft preparation, T.A.I., M.K. (Mustafa Kaiiali), M.K. (Muhammad Kazim) and S.K.; writing—review and editing, T.A.I., M.K. (Mustafa Kaiiali), M.K. (Muhammad Kazim) and S.K.; visualization, T.A.I.; supervision, M.K. (Mustafa Kaiiali), M.K. (Muhammad Kazim) and S.K. All authors have read and agreed to the published version of the manuscript.

Funding

S.K. acknowledges funding by De Montfort University for the computational facilities (VC2020 new staff L SL 2020).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this research is publicly available at https://research.unsw.edu.au/projects/unsw-nb15-dataset (accessed on 24 July 2022). The UNSW-NB15 source files (pcap files, BRO files, Argus Files, CSV files, and the reports) can be downloaded from https://cloudstor.aarnet.edu.au/plus/index.php/s/2DhnLGDdEECo4ys (accessed on 24 July 2022). We can make the code available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NIDS	Network-Intrusion-Detection System
DT	Decision Tree
LR	Logistic Regression
NB	Naive Bayes
ANN	Artificial Neural Network
EM	Expectation–Maximization
AODE	Averaged One-Dependence Estimators
RF	Random Forest
Softmax	Softmax Classifier
Sigmoid	Sigmoid Classifier
TP	True Positive
FP	False Positive
TN	True Negative
FN	False Negative

References

Zhengbing, H.; Zhitang, L.; Junqi, W. A Novel Network-Intrusion-Detection System (NIDS) Based on Signatures Search of Data Mining. In Proceedings of the First International Workshop on Knowledge Discovery and Data Mining (WKDD 2008), Adelaide, Australia, 23–24 January 2008; pp. 10–16. [Google Scholar] [CrossRef]
García-Teodoro, P.; Díaz-Verdejo, J.; Maciá-Fernández, G.; Vázquez, E. Anomaly-based network intrusion detection: Techniques, systems and challenges. Comput. Secur. 2009, 28, 18–28. [Google Scholar] [CrossRef]
Iqbal, M.J.; Faye, I.; Said, A.M.; Samir, B.B. Computational Technique for an Efficient Classification of Protein Sequences with Distance-Based Sequence Encoding Algorithm: Protein Classification via Distance Based Encoding. Comput. Intell. 2017, 33, 32–55. [Google Scholar] [CrossRef]
Suyehira, K. Using DNA For Data Storage: Encoding and Decoding Algorithm Development. Ph.D. Thesis, Boise State University, Boise, ID, USA, 2018. [Google Scholar] [CrossRef]
Rashid, O.F.; Othman, Z.A.; Zainudin, S. Four Char DNA Encoding for Anomaly Intrusion Detection System. In Proceedings of the 2019 5th International Conference on Computer and Technology Applications, Istanbul, Turkey, 16–17 April 2019. [Google Scholar] [CrossRef]
Rashid, O.F. DNA encoding for misuse intrusion detection system based on UNSWNB15 data set. Iraqi J. Sci. 2020, 61, 3408–3416. [Google Scholar] [CrossRef]
Cho, H.; Lim, S.; Belenko, V.; Kalinin, M.; Zegzhda, D.; Nuralieva, E. Application and improvement of sequence alignment algorithms for intrusion detection in the Internet of Things. In Proceedings of the 2020 IEEE Conference on Industrial Cyberphysical Systems (ICPS), Tampere, Finland, 10–12 June 2020. [Google Scholar] [CrossRef]
Rashid, O.F.; Othman, Z.A.; Zainudin, S.; Samsudin, N.A. DNA Encoding and STR Extraction for Anomaly Intrusion Detection Systems. IEEE Access 2021, 9, 31892–31907. [Google Scholar] [CrossRef]
Rashid, O.F.; Al-Hakeem, M.S. Hybrid Intrusion Detection System based on DNA Encoding, Teiresias Algorithm and Clustering Method. Webology 2022, 19, 508–520. [Google Scholar] [CrossRef]
Cevallos, Y.; Nakano, T.; Tello-Oquendo, L.; Rushdi, A.; Inca, D.; Santillán, I.; Shirazi, A.Z.; Samaniego, N. A brief review on DNA storage, compression, and digitalization. Nano Commun. Netw. 2022, 31, 100391. [Google Scholar] [CrossRef]
Jing, X.; Dong, Q.; Hong, D.; Lu, R. Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020, 17, 1918–1931. [Google Scholar] [CrossRef]
ElAbd, H.; Bromberg, Y.; Hoarfrost, A.; Lenz, T.; Franke, A.; Wendorff, M. Amino acid encoding for deep learning applications. BMC Bioinform. 2020, 21, 235. [Google Scholar] [CrossRef]
Yan, J.F.; Yan, A.K.; Yan, B.C. Prime numbers and the amino acid code: Analogy in coding properties. J. Theor. Biol. 1991, 151, 333–341. [Google Scholar] [CrossRef]
Sabry, M.; Hashem, M.; Nazmy, T. Digital Encoding to the form of Amino Acids for DNA Cryptography and Biological Simulation. Int. J. Comput. Appl. 2017, 165, 15–20. [Google Scholar] [CrossRef]
Yu, C.; Cheng, S.Y.; He, R.L.; Yau, S.S.T. Protein map: An alignment-free sequence comparison method based on various properties of amino acids. Gene 2011, 486, 110–118. [Google Scholar] [CrossRef]
Négadi, T. The Genetic Code Degeneracy and the Amino Acids Chemical Composition are Connected. NeuroQuantology 2009, 7, 181–187. [Google Scholar] [CrossRef]
Simmons, M.P.; Ochoterena, H.; Freudenstein, J.V. Conflict between Amino Acid and Nucleotide Characters. Cladistics 2002, 18, 200–206. [Google Scholar] [CrossRef] [PubMed]
Lin, K.; May, A.C.; Taylor, W.R. Amino Acid Encoding Schemes from Protein Structure Alignments: Multi-dimensional Vectors to Describe Residue Types. J. Theor. Biol. 2002, 216, 361–365. [Google Scholar] [CrossRef] [PubMed]
Siddique, K.; Akhtar, Z.; Aslam Khan, F.; Kim, Y. KDD Cup 99 Data Sets: A Perspective on the Role of Data Sets in Network Intrusion Detection Research. Computer 2019, 52, 41–51. [Google Scholar] [CrossRef]
Tavallaee, M.; Bagheri, E.; Lu, W.; Ghorbani, A.A. A detailed analysis of the KDD CUP 99 data set. In Proceedings of the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, ON, Canada, 8–10 July 2009; pp. 1–6. [Google Scholar] [CrossRef]
Damasevicius, R.; Venckauskas, A.; Grigaliunas, S.; Toldinas, J.; Morkevicius, N.; Aleliunas, T.; Smuikys, P. LITNET-2020: An Annotated Real-World Network Flow Dataset for Network Intrusion Detection. Electronics 2020, 9, 800. [Google Scholar] [CrossRef]
Moustafa, N.; Slay, J. UNSW-NB15: A comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In Proceedings of the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, Australia, 10–12 November 2015; pp. 1–6. [Google Scholar] [CrossRef]
Dal Pozzolo, A.; Caelen, O.; Bontempi, G. When is undersampling effective in unbalanced classification tasks? In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2015, Porto, Portugal, 7–11 September 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 200–215. [Google Scholar]
Liu, X.Y.; Wu, J.; Zhou, Z.H. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part (Cybern.) 2008, 39, 539–550. [Google Scholar]
Bach, M.; Werner, A.; Palt, M. The proposal of undersampling method for learning from imbalanced datasets. Procedia Comput. Sci. 2019, 159, 125–134. [Google Scholar] [CrossRef]
Farshidvard, A.; Hooshmand, F.; MirHassani, S. A novel two-phase clustering-based under-sampling method for imbalanced classification problems. Expert Syst. Appl. 2023, 213, 119003. [Google Scholar] [CrossRef]
Javadi Moghaddam, S.M.; Noroozi, A. A novel imbalanced data classification approach using both under and over sampling. Bull. Electr. Eng. Inform. 2021, 10, 2789–2795. [Google Scholar] [CrossRef]
Al-Shahib, A.; Breitling, R.; Gilbert, D. Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence. Appl. Bioinform. 2005, 4, 195–203. [Google Scholar] [CrossRef]
Arafat, M.Y.; Hoque, S.; Xu, S.; Farid, D.M. An Under-Sampling Method with Support Vectors in Multi-class Imbalanced Data Classification. In Proceedings of the 2019 13th International Conference on Software, Knowledge, Information Management and Applications (SKIMA), Ulkulhas, Maldives, 26–28 August 2019; pp. 1–6. [Google Scholar] [CrossRef]
García, S.; Herrera, F. Evolutionary undersampling for classification with imbalanced datasets: Proposals and taxonomy. Evol. Comput. 2009, 17, 275–306. [Google Scholar] [CrossRef]
Ferriyan, A.; Thamrin, A.H.; Takeda, K.; Murai, J. Generating Network Intrusion Detection Dataset Based on Real and Encrypted Synthetic Attack Traffic. Appl. Sci. 2021, 11, 7868. [Google Scholar] [CrossRef]
ssbio Online Documentation. Available online: https://ssbio.readthedocs.io/en/latest/_modules/ssbio/protein/sequence/properties/residues.html (accessed on 15 March 2023).
Cock, P.J.A.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; et al. BioPython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25, 1422–1423. [Google Scholar] [CrossRef] [PubMed]
ssbio Framework. Available online: https://ssbio.readthedocs.io/en/latest/index.html (accessed on 15 March 2023).
Shmueli, G.; Bruce, P.C.; Deokar, K.R.; Patel, N.R. Machine Learning for Business Analytics: Concepts, Techniques, and Applications with Analytic Solver Data Mining; John Wiley & Sons: New York, NY, USA, 2023. [Google Scholar]
Zhang, T. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning, Banff, AB, Canada, 4–8 July 2004. [Google Scholar]
Scikit-Learn Map. Available online: https://scikit-learn.org/stable/_static/ml_map.png (accessed on 2 September 2023).
Mih, N.; Brunk, E.; Chen, K.; Catoiu, E.; Sastry, A.; Kavvas, E.; Monk, J.M.; Zhang, Z.; Palsson, B.O. ssbio: A Python Framework for Structural Systems Biology. Bioinformatics 2018, 34, 2155–2157. [Google Scholar] [CrossRef]
Scikit-Learn. Available online: https://scikit-learn.org/ (accessed on 2 September 2023).
Moustafa, N.; Slay, J. The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set. Inf. Secur. J. Glob. Perspect. 2016, 25, 18–31. [Google Scholar] [CrossRef]
Khammassi, C.; Krichen, S. A GA-LR wrapper approach for feature selection in network intrusion detection. Comput. Secur. 2017, 70, 255–277. [Google Scholar] [CrossRef]
Roy, A.; Singh, K.J. Multi-classification of unsw-nb15 dataset for network anomaly detection system. In Proceedings of the International Conference on Communication and Computational Technologies: ICCCT-2019, Jaipur, India, 30–31 August 2019; Springer: Berlin/Heidelberg, Germany, 2021; pp. 429–451. [Google Scholar]
Janarthanan, T.; Zargari, S. Feature selection in UNSW-NB15 and KDDCUP’99 datasets. In Proceedings of the 2017 IEEE 26th International Symposium on Industrial Electronics (ISIE), Edinburgh, UK, 19–21 June 2017; pp. 1881–1886. [Google Scholar]
Khan, F.A.; Gumaei, A.; Derhab, A.; Hussain, A. A Novel Two-Stage Deep Learning Model for Efficient Network Intrusion Detection. IEEE Access 2019, 7, 30373–30385. [Google Scholar] [CrossRef]
Sinha, J.; Manollas, M. Efficient Deep CNN-BiLSTM Model for Network Intrusion Detection. In Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition, New York, NY, USA, 7–12 February 2020; pp. 223–231. [Google Scholar] [CrossRef]
Wu, P.; Guo, H. LuNet: A Deep Neural Network for Network Intrusion Detection. In Proceedings of the 2019 IEEE Symposium Series on Computational Intelligence (SSCI), Xiamen, China, 6–9 December 2019; pp. 617–624. [Google Scholar] [CrossRef]
Kumar, V.; Das, A.K.; Sinha, D. Statistical Analysis of the UNSW-NB15 Dataset for Intrusion Detection. In Computational Intelligence in Pattern Recognition: Proceedings of CIPR 2019; Das, A.K., Nayak, J., Naik, B., Pati, S.K., Pelusi, D., Eds.; Springer: Singapore, 2020; pp. 279–294. [Google Scholar]
Bonet, J.; Chen, M.; Dabad, M.; Heath, S.; Gonzalez-Perez, A.; Lopez-Bigas, N.; Lagergren, J. DeepMP: A deep learning tool to detect DNA base modifications on Nanopore sequencing data. Bioinformatics 2022, 38, 1235–1243. [Google Scholar] [CrossRef]
Sukhorukov, G.; Khalili, M.; Gascuel, O.; Candresse, T.; Marais-Colombel, A.; Nikolski, M. VirHunter: A Deep Learning-Based Method for Detection of Novel RNA Viruses in Plant Sequencing Data. Front. Bioinform. 2022, 2, 867111. [Google Scholar] [CrossRef]
Tampuu, A.; Bzhalava, Z.; Dillner, J.; Vicente, R. ViraMiner: Deep learning on raw DNA sequences for identifying viral genomes in human samples. PLoS ONE 2019, 14, e0222271. [Google Scholar] [CrossRef] [PubMed]
Dasari, C.M.; Bhukya, R. Explainable deep Neural Networks for novel viral genome prediction. Appl. Intell. 2022, 52, 3002–3017. [Google Scholar] [CrossRef] [PubMed]
Habib, P.T.; Alsamman, A.M.; Saber-Ayad, M.; Hassanein, S.E.; Hamwieh, A. COVIDier: A Deep-learning Tool For Coronaviruses Genome And Virulence Proteins Classification. bioRxiv 2020. [Google Scholar] [CrossRef]
Izumi, H.; Nafie, L.A.; Dukor, R.K. SSSCPreds: Deep Neural Network-Based Software for the Prediction of Conformational Variability and Application to SARS-CoV-2. ACS Omega 2020, 5, 30556–30567. [Google Scholar] [CrossRef]
Lefkovits, S.; Lefkovits, L. Gabor feature selection based on information gain. Procedia Eng. 2017, 181, 892–898. [Google Scholar] [CrossRef]
Ardelean, F.A. Case study using analysis of variance to determine groups’ variations. MATEC Web Conf. 2017, 126, 04008. [Google Scholar] [CrossRef]
Benhamou, E.; Melot, V. Seven proofs of the Pearson Chi-squared independence test and its graphical interpretation. arXiv 2018, arXiv:1808.09171. [Google Scholar] [CrossRef]

Figure 1. Essential Amino acids.

Figure 2. Our proposed Network-Intrusion-Detection System.

Figure 3. Neural Network topology used to determine the standardization method.

Table 1. Comparison of network intrusion datasets [21].

Data-Set	Year	No. of Records	Labeled Attack Types
DARPA’98	1998	5 M records	n/a
KDD Cup’99	1999	4.9 M single-connection vectors	DoS, U2R, R2L, Probing Attack
DDoS 2016	2016	734,627 records	DDoS attack (HTTP Flood, SIDDOS, UDP Flood, and Smurf)
UNSW-NB15	2018	2 Million	Fuzzers, Backdoors, DoS, Exploits, Generic, Reconnaissance, Shellcode, Worms
CICIDS 2017	2017	n/a	Botnet, Brute Force SSH, DoS, DDoS, FTP, Infiltration, Heartbleed, Web Attack
UGR’16	2016	16,900 M	DoS, Scan, Botnet (synthetic), IP in blacklist, UDP Scan, SSH Scan, SPAM, anomaly

Table 2. Essential Amino acids and their Vigesimal index.

Amino Acid ¹	Letter	Vigesimal Index
Alanine	A	0
Arginine	R	1
Asparagine	N	2
Aspartic Acid	D	3
Cysteine	C	4
Glutamic Acid	E	5
Glutamine	Q	6
Glycine	G	7
Histidine	H	8
Isoleucine	I	9
Leucine	L	A
Lysine	K	B
Methionine	M	C
Phenylalanine	F	D
Proline	P	E
Serine	S	F
Threonine	T	G
Tryptophan	W	H
Tyrosine	Y	I
Valine	V	J

¹ Amino acids arranged in alphabetical order.

Table 3. Amino acid encoding sample.

Step	Srcip	Sport	Proto	State
Raw Data	59.166.0.0	33661	udp	CON
Capitalize	59.166.0.0	33661	UDP	CON
ASCII	[53, 57, 46, 49, 54, 54, 46, 48, 46, 48]	[51, 51, 54, 54, 49]	[85, 68, 80]	[67, 79, 78]
Modulo 20	[13, 17, 6, 9, 14, 14, 6, 8, 6, 8]	[11, 11, 14, 14, 9]	[5, 8, 0]	[7, 19, 18]
Amino acids	[F, W, Q, I, P, P, Q, H, Q, H]	[K, K, P, P, I]	[E, H, A]	[G, V, Y]

Table 4. Sample Amino acid signatures.

Row Signatures’ Sample
IQDTFIFQEDDHTRKTYQKHHKIEDYRKQKDQE…FIRFHDHYFFQEFKHKDDTRQIITHKRKRH
YRYDFEIRDFHDIIYYQTRQYTHEERDDEDFEY…QTQRQKRFFTFEDYTEHTKIKEREDKKFTIRDT
QHYRKDRTHQKRFHTFQQFFKREQRQFERYQF…QIIIYHDKKTEHQIYEKYEETIHKEHTYYIQTF

Table 5. Differences between artificially generated and natural Amino acid sequences.

Artificially Generated Amino Acids	Natural Amino Acids
In our research, we constructed Amino acid sequences using predefined rules and algorithms. These sequences are often simplified representations of Amino acids and may not fully capture the complexity of natural Amino acid structures.	Natural Amino acid sequences found in living organisms are the result of millions of years of evolution. Each Amino acid in these sequences has a specific, intricate structure, and their arrangement is finely tuned to perform biological functions. Natural sequences are diverse and intricate.
When creating artificial sequences, we followed predetermined rules that dictate the order and composition of Amino acids. These rules are based on our understanding of natural sequences, but may lack the nuances present in actual biological processes.	Natural sequences are constructed through highly complex biological processes. They follow specific genetic instructions encoded in DNA. The synthesis of natural Amino acid sequences involves intricate biochemical pathways, including transcription and translation, ensuring the precise arrangement of Amino acids.
Our artificially generated sequences are primarily designed for research purposes. While they may mimic certain aspects of natural sequences, their main goal is to test and evaluate our methods.	Natural sequences have specific functions in living organisms. They play critical roles in various biological processes, such as enzyme activity, cell signaling, and structural support. The functionality of natural sequences is essential for the survival and functioning of organisms.
Structural feature values of our artificially generated sequences were found to be way out of the standard boundaries compared to similar features for Amino acid sequences in nature.	Most natural Amino acid sequences have their structural feature values ranging in the standard boundaries discovered in nature.

Table 6. Feature values using the example sequence IQDTFIFQEDDHTRKTYQ…DTRQIITHKRKRH.

Feature	Value
$β$ Turn	0.0
$β$ Strand	0.05
Molecular Weight	10,703.86
Aromaticity	0.15
Instability Index	0.60
Isoelectric Point	9.77
$α$ Helix	0.26
Reduced Cysteines	5210
Disulfide Bridges	5960
GRAVY	−1.82

Table 7. Structural features’ generation steps.

Step	Srcip	Sport	Proto	State
Raw Data	10.40.100.8	1131	tcp	CON
Capitalize	10.40.100.8	1131	TCP	CON
ASCII	[49, 48, 46, 52, 48, 46, 49, 48, 48, 46, 56]	[49, 49, 51, 49]	[84, 67, 80]	[67, 79, 78]
Modulo 20	[9, 8, 6, 12, 8, 6, 9, 8, 8, 6, 16]	[9, 9, 11, 9]	[4, 7, 0]	[7, 19, 18]
Amino acids	[I, H, Q, M, H, Q, I, H, H, Q, T]	[I, I, K, I]	[C, G, A]	[G, V, Y]
Whole-Row Signature Sequence	IHQMHQIHHTIIKICGAGVY
Structural Features’ Sample	Molecular Weight	Aromaticity	Instability Index	GRAVY
Newly Transformed Features	2290.18	0.15	45	−0.63

Table 8. Neural Network parameters.

Parameter	Value
activation function	Sigmoid
loss function	binary cross-entropy
Epochs	100
Batch Size	5

Table 9. Experiments stats per data preparation methods.

Data Standardization Method	Accuracy—Mean	Accuracy—Standard Deviation
None	55%	8%
Standard Scaler ¹	96.7%	6%
Normal Scaler ²	96.9%	1%

¹ Mean = 0, Std = 1. ² Mean = column values mean, Std = column values Std from its mean.

Table 10. Deep learning models.

Experiment	Hidden Layers (Nodes)	Accuracy—Mean	Accuracy—Standard Deviation
Deep Model 1	2 layers (10 × 5)	97.3%	30%
Deep Model 2	4 layers (10 × 15 × 15 × 10)	98.3%	15%

Table 11. Experimental hyperparameters.

Hyperparameter	Experimental Range
Batch Size	[50, 100, 200, 300, 500, 1000]
No. of Epochs	[10, 50, 100, 150, 200, 350, 500]
No. of Folds in K-Fold Cross-Validation ¹	[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

¹ For every set of hyperparameters, the procedure involves partitioning the training data into k subsets, often referred to as folds. The model is then trained on k-1 of these folds, while the remaining fold serves as a validation set for assessing the model’s performance. This approach ensures a comprehensive evaluation of the model’s capabilities across different subsets of the training data.

Table 12. Performance of our Deep Model 2 using the UNSW-NB15 dataset.

	Precision	Recall	F1-Score	Support	Overall
0 (Benign)	97%	98%	97%	99,643
1 (Attack)	98%	97%	97%	99,643
Macro Avg	97%	97%	97%	199,286
Weighted Avg	97%	97%	97%	199,286
Accuracy					97%
Precision					98%
Recall					97%
F1-Score					97%

Table 13. Definitions of the performance metrics used for the evaluation.

Metric	Description
Precision	Precision is a measure of how many of the instances predicted as positive (in this case, 1—attack) were actually positive. For label 0 (benign), the precision was 97%, and for label 1 (attack), the precision was 98%.
Recall	Recall, also known as sensitivity or the true positive rate, measures how many of the actual positive instances were correctly predicted as positive. For label 0 (benign), the recall was 98%, and for label 1 (attack), the recall was 97%.
F1-Score	The F1-Score is the harmonic mean of precision and recall. It provides a balanced measure of a model’s accuracy, especially when dealing with imbalanced datasets. For label 0 (benign), the F1-Score was 97%, and for label 1 (attack), the F1-Score was 97%.
Support	Support refers to the number of actual occurrences of each class in your dataset. For label 0 (benign), there were 99,643 actual instances, and for label 1 (attack), there were also 99,643 actual instances.
Accuracy	The accuracy represents the overall correct predictions made by the model across both classes. In this case, the overall accuracy was 97%, indicating that the model correctly predicted 97% of the instances.
Macro Avg	The macro-average calculates the average of the precision, recall, and F1-Score for each class separately and, then, takes the unweighted mean of these values. In this table, the macro-average for the precision, recall, and F1-Score was 97%.
Weighted Avg	The weighted average calculates the average of the precision, recall, and F1-Score for each class separately, but it takes into account the class support (number of instances). It provides a weighted mean of these values, giving more weight to classes with more instances. In this table, the weighted-average for the precision, recall, and F1-Score was 97%.

Table 14. Confusion matrix of our Deep Model 2 using the UNSW-NB15 dataset.

Predicted	Actual
Predicted	0 (Benign)	1 (Attack)
0 (Benign)	97,308 (97.77%)	2335 (2.34%)
1 (Attack)	3039 (2.43%)	96,604 (97.63%)

Table 15. Deep Model 2 best accuracy results per hyperparameter combination, measured on the under-sampled version of UNSW-NB15 with a 50/50 ratio of attack labels and a total of 200k records.

K-Fold Cross-Validation	Batch Size	No. of Epochs	Accuracy—Mean	Accuracy—Standard Deviation
10	50	500	98.91%	17%
	100	500	98.90%	23%
	200	500	98.84%	23%
20	100	500	99.0%	27%
	50	200	98.93%	25%
	50	350	98.91%	37%
30	300	500	98.92%	28%
	100	350	98.91%	34%
	50	350	98.87%	30%
40	50	350	98.92%	35%
	50	500	98.90%	27%
	200	500	98.88%	30%
50	200	500	98.95%	37%
	100	500	98.92%	33%
	50	500	98.91%	36%

Table 16. State of-the-art comparisons for intrusion detection using the UNSW-NB15 dataset.

Research	No. of Features	Classifier	ACC (%)
Moustafa and Slay [40]	42	DT	85.56
	42	LR	83.15
	42	NB	82.07
	42	ANN	81.34
	42	EM	78.74
GALR-DT [41]	20	DT	81.42
NAWIR et al. [42]	42	AODE	83.47
Janarthanan and Zargari [43]	5	RF	81.62
Standard MLP	42	Softmax	81.30
Khan et al. [44]	10	Softmax	89.13
Sinha et al. [45]	42	CNN and LSTM	93.08
Peilun et al. [46]	42	CNN and LSTM	98.18
Our Model (Deep Classifier)	10	Sigmoid	99.0
Vikash et al. [47]	42	DT	99.37

Table 17. Accuracies achieved by selected virus-detection tools using deep learning.

Tool	Best Estimated Accuracy
DeepMP [48]	93.5%
VirHunter [49]	89.2%
ViraMiner [50]	91.8%
EdeepVPP [51]	92.7%
CoviDier [52]	94.6%
SSSCPreds [53]	88.9%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ibaisi, T.A.; Kuhn, S.; Kaiiali, M.; Kazim, M. Network Intrusion Detection Based on Amino Acid Sequence Structure Using Machine Learning. Electronics 2023, 12, 4294. https://doi.org/10.3390/electronics12204294

AMA Style

Ibaisi TA, Kuhn S, Kaiiali M, Kazim M. Network Intrusion Detection Based on Amino Acid Sequence Structure Using Machine Learning. Electronics. 2023; 12(20):4294. https://doi.org/10.3390/electronics12204294

Chicago/Turabian Style

Ibaisi, Thaer AL, Stefan Kuhn, Mustafa Kaiiali, and Muhammad Kazim. 2023. "Network Intrusion Detection Based on Amino Acid Sequence Structure Using Machine Learning" Electronics 12, no. 20: 4294. https://doi.org/10.3390/electronics12204294

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Network Intrusion Detection Based on Amino Acid Sequence Structure Using Machine Learning

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset

3.2. Data Preprocessing

3.3. Amino Acid Mapping

3.4. Feature Transformation: Structural Properties of Amino Acid Sequences

3.5. Neural Network Learning Model

3.5.1. Deep Learning Model

3.5.2. Hyperparameter Tuning

3.6. Software and Hardware

4. Results

5. Discussion

6. Conclusions and Future Work

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI