The proposed methodology follows a systematic and structured pipeline designed to ensure both robustness and reproducibility in intrusion detection. As illustrated in
Figure 1, the process begins with exploratory data analysis (EDA), where the datasets are examined to identify patterns, class imbalances, feature distributions, and potential anomalies. This step provides a critical foundation for understanding the underlying structure of the TON_IoT and CICIDS2017 datasets, which contain heterogeneous and large-scale traffic instances. Following EDA, a comprehensive preprocessing phase is performed to refine the data and prepare it for deep learning models. This includes handling missing values, normalizing numeric features, encoding categorical attributes, and applying class balancing techniques to mitigate skewness between normal and attack samples. After preprocessing, the refined dataset is fed into the modeling stage, which represents the core contribution of our work. Three hybrid deep learning architectures are explored: CNN-LSTM, CNN-GRU, and our proposed CNN-LSTM-GRU fusion model. The CNN-LSTM model leverages convolutional layers for spatial feature extraction followed by LSTM layers to capture long-term temporal dependencies, whereas the CNN-GRU model integrates GRU to efficiently handle shorter-term dependencies with reduced computational overhead. In contrast, the proposed CNN-LSTM-GRU architecture introduces a parallel feature extraction and fusion mechanism, where CNN, LSTM, and GRU operate concurrently on the same input space, thereby capturing complementary representations of spatial, long-term, and short-term patterns. The outputs of these branches are integrated through a dedicated fusion layer, allowing the model to achieve faster convergence and improved feature extraction efficiency compared to sequential hybrids. Finally, the evaluation stage measures the performance of each architecture using multiple metrics including accuracy, precision, recall, and F1-score, across both binary and multiclass classification tasks. This rigorous evaluation provides empirical evidence of the effectiveness of the proposed methodology, highlighting the superiority of the parallel CNN-LSTM-GRU model over traditional hybrid architectures and ensuring its applicability to real-world IoT and network intrusion scenarios.
3.2. CICIDS2017 Dataset
The CICIDS2017 dataset is a comprehensive benchmark for intrusion detection research, designed to closely resemble real-world traffic by incorporating both benign flows and a diverse set of contemporary cyberattacks. Collected over a five-day period in July 2017, it integrates naturalistic background traffic generated through the B-Profile system, which simulates realistic user behavior across multiple protocols (HTTP, HTTPS, FTP, SSH, and email). The dataset includes a wide range of attack scenarios such as brute force, DoS/DDoS, Heartbleed, infiltration, web-based threats, and botnets, executed in controlled yet realistic environments. Network flows were extracted using CICFlowMeter, providing over 80 detailed features per flow and ensuring precise labeling based on timestamps, IP addresses, ports, and protocols. Its richness lies in fulfilling critical dataset design criteria covering complete network configurations, traffic diversity, heterogeneity, interaction realism, and attack variety making it one of the most reliable and widely used benchmarks for evaluating intrusion detection systems.
3.3. Exploratory Data Analysis and Preprocessing
In order to prepare the datasets for robust training and evaluation, a rigorous process of exploratory data analysis (EDA) and preprocessing was conducted on both TON_IoT and CICIDS2017. The TON_IoT dataset initially consisted of 225,745 rows and 79 features, employed for binary classification with the label distribution heavily skewed, containing approximately 300,000 normal instances (label 0) and 161,043 attack instances (label 1). This imbalance is illustrated in
Figure 2, where the prevalence of benign samples significantly outweighs attack records. To mitigate this imbalance, we applied an undersampling strategy, randomly sampling 50,000 records from each class to create a balanced dataset of 100,000 instances, as shown in
Figure 3. Following this, features were explicitly typed into categorical and numerical groups, missing values were imputed using median or most frequent strategies, and categorical attributes were transformed via one-hot encoding while numerical ones were standardized using z-score normalization. Stratified train–validation–test splits ensured preservation of label proportions across subsets, while undersampling was applied only to the training set to prevent data leakage.
In the case of the ToN-IoT dataset, which originally contains nine different attack categories along with normal traffic, one of the major challenges lies in the severe class imbalance problem. As illustrated in
Figure 4, the distribution of attack classes before balancing shows that normal traffic instances dominate the dataset with nearly 300,000 samples, while the different attack types such as scanning, DoS, DDoS, injection, password, XSS, ransomware, backdoor, and MITM attacks are significantly underrepresented, each with far fewer instances. This imbalance can bias the training process, causing the model to prioritize majority classes while ignoring minority but equally critical classes, leading to reduced detection capability for rare but severe attack scenarios. To overcome this limitation and ensure fair learning across all classes, we applied a downsampling strategy that equalized the class distribution. After balancing, as depicted in
Figure 5, each class, including the normal traffic and all nine attack categories, was adjusted to approximately 10,000 samples, resulting in a balanced dataset for multiclass classification. This preprocessing step ensures that the proposed hybrid deep learning architecture can learn representative patterns across all attack types, enhancing its generalization and detection accuracy. By balancing the dataset, we eliminated bias toward majority classes and created a more robust foundation for evaluating the effectiveness of the CNN-LSTM-GRU model in detecting diverse IoT cyber threats.
For the CICIDS2017 dataset, which captures realistic network flows comprising both benign and malicious traffic, a rigorous preprocessing pipeline was applied to ensure data quality and suitability for deep learning models. The initial dataset contained over 225,000 network flows, with a highly imbalanced distribution of 128,027 attack instances (DDoS and other attack types) and 97,714 benign flows. Quality control involved the removal of missing values, infinite entries, and 2633 duplicate rows, thereby improving the reliability of the dataset. To prepare the features, numerical attributes were scaled using Min–Max normalization to a uniform range, while categorical variables were transformed via one-hot encoding, producing a consistent input representation. The binary classification labels were encoded with BENIGN mapped to 0 and all attack categories mapped to 1. Following preprocessing, the data were partitioned into stratified training and testing sets, with 178,465 samples (78 features) allocated for training and 44,617 samples for testing. However, a class imbalance persisted in the training set, where attack flows (102,411) significantly outnumbered benign flows (76,054). To address this, we applied the Synthetic Minority Oversampling Technique (SMOTE), which generated synthetic benign samples to equalize the distribution. After balancing, both classes in the training set contained 102,411 instances, as shown in
Table 2. This adjustment not only ensured that the deep learning models were trained on balanced data but also maintained the integrity of the held-out test set, which remained unaltered to preserve real-world proportions. The final preprocessing outputs consisted of balanced, normalized, and leakage-free feature matrices with one-hot encoded labels, thereby strengthening the experimental design and supporting a fair evaluation of the proposed hybrid models.
3.4. Hybrid Deep Learning Models
The current CNN-LSTM method combines convolutional neural networks (CNNs) and long short-term memory (LSTM) networks in order to offer both temporal and spatial information. Simultaneously, recent research demonstrates that intrusion detectors in federated IoT environments are adversarial and data-poisoned, which highlights the importance of an architecture that can resist distributed training and attack-sensitive verification [
43,
44]. As CNN can be considered to be a type of a feedforward deep learning network, it is particularly appropriate in extracting localized spatial features and patterns of high-dimensional input data; network traffic data can be provided as an example. It also over convolves the input matrices and it tries to pinpoint some of the characteristics of the packets like the packet header or like the flow properties which are highly valued in the context of identifying malicious activities in network traffic [
45]. Nonetheless, CNN cannot alone capture or capture time or temporal sequence in the temporal series data [
46]. This is resolved by feeding the retrieved spatial information to the LSTM layer that could make the timeline sequence on its own and long-term time dependencies in the input sequences. One of the advantages of the LSTM type of recurrent neural network (RNN) is three gates; input, forget, and output gates to regulate the flow of input and output and remove such issues as the disappearance of gradients and makes it appropriate to learn network data in the sequence of its generation. CNN-LSTM architecture can be used to learn the spatial characteristics of the IoT traffic along with the temporal characteristics needed to detect the various intrusion patterns [
47].The CNN-GRU model integrated replaced the LSTM unit with a Gated Recurrent Units (GRU) component, and the spatial features extraction is achieved through convolutional neural networks (CNNs). Like LSTM, GRU is a type of recurrent neural network meant for working with sequential data, but it has a less computationally expensive architecture [
48]. In GRU, the gating mechanism is therefore reduced to the update gate which controls what information is going to be produced, while the reset gate helps to decide what information should be forgotten or not. This diminishes the number of parameters and leads to faster learning and converge time and fewer computational requirements compared to LSTM without sacrificing temporal dependencies’ capturing capability. CNN and GRU are combined in a way where CNN extracts spatial features and GRU captures short to medium temporal dependencies, making this model suitable for live image analyses where computational time is paramount. The proposed CNN-GRU model has a moderate time complexity and suitable accuracy, which is suitable for intrusion detection in IoT systems with relatively limited storage space and computation power.
CNN is employed by incorporating convolutional layers which extract multiple features, and LSTM-GRU operators are used to capture sequential information in a fast and effective way. In this hybrid model, the CNN component first obtains spatial structure feature of the input IoT network traffic data, which aims to find out the local structure and correlation. These extracted features are then forwarded to both LSTM and GRU layers to take the best out from each of both those layers. LSTM has a strong memory that can remember more long-time dependencies and interesting sequential data features while GRU is more compact, computationally more efficient, and faster in convergence than LSTM. Overall, the CNN-LSTM model is able to learn spatial features and long-term dependencies, the CNN-GRU model is able to learn short-term and medium dependencies efficiently, and the CNN-LSTM-GRU is the combination of these components forming a unifying framework for solutions which balances accuracy, efficiency, and scalability. Combining these hybrid architectures helps alleviate the issues with IoT traffic analysis difficulties, which provides better means to detect cyber threats in such contexts.
All the models were trained on the Adam optimizer with a learning rate of 0.001, a batch size of 64 and a maximum of 50 epochs. Early stopping was applied using a tolerance of 5 on the basis of validation loss. The CNN-LSTM-GRU model was trained on average at a rate of 42 min using NVIDIA Tesla V100 GPU (NVIDIA Corporation, Santa Clara, CA, USA), whereas CNN-GRU and CNN-LSTM models were trained in 28 and 34 min, respectively. The selection of these settings was guided by hyperparameter pre-tuning in order to balance the convergence and stability of performance.
- A.
Architectural Design and Model Parameters
To resolve the need to provide more details regarding the architecture, we explain the internal architecture and arrangement of the designed CNN-LSTM-GRU hybrid model. The connected network takes the form of a two-dimensional tensor as the input.
Where (T,F), T is the time steps, F is the number of features that an instance has. Spatial features are then obtained with the help of a convolutional block of 1D networks that consists of a convolutional layer of 64 filters and a kernel set to 3 with a max-pooling layer (pool size = 2) and a batch normalization layer to stabilize the learning process and promote convergence.
The CNN block sends its output successively to two temporal learning branches, which are the Long Short-Term Memory (LSTM) layer and the Gated Recurrent Unit (GRU) layer. The two repeated layers have been set to 64 hidden units, returned_sequences = True and dropout = 0.3 to reduce overfitting. This parallel architecture enables the model to learn long-term dependence on LSTM and short- to medium-term dependence on GRU. A concatenation operation is then performed to combine the respective outputs and this forms the feature fusion mechanism of the architecture.
Two fully connected layers are calculated with a dropout layer to boost the generalization and separated by a dropout layer (rate = 0.4). The fused representation in time are passed through two fully connected layers both consisting of 128 and 64 neurons, respectively, activated through the ReLU function. The output layer differs in accordance with the classification task: a single sigmoid-activated neuron is utilized in the binary classification and a SoftMax-activated dense layer with C output neurons is implemented in the multiclass classification where c is the quantity of attack classes.
To optimize the model, Adam is used as the optimizer, the learning rate is set to 0.001 and the batch size is 64. The training terminates at 50 epochs and early stopping (patience = 5) is used so as not to overfit. There is also a learning schedule of rates (ReduceLROnPlateau) in order to automatically reduce the rate of learning in a case of stagnation in validation (factor = 0.5, patience = 3).
The network is taught with the loss of a binary cross-entropy with binary classification:
and categorical cross-entropy loss for multiclass tasks:
where
and
denote the ground truth and predicted probability for class
j of sample
i, respectively.
This architecture not only ensures comprehensive spatial–temporal modeling but also maintains computational efficiency, making it particularly suitable for real-time intrusion detection in resource-constrained IoT and IIoT environments. The modular design enables effective learning from imbalanced, sequential, and high-dimensional network traffic data.
The CNN-LSTM-GRU pipeline can be formally described as
where
.
Here, x represents the input features, is the spatial representation from CNN, and LSTM/GRU extract parallel temporal features, concatenated and passed through dense layers for final classification.
The detailed configuration of the proposed CNN–LSTM–GRU architecture is presented in
Table 3, which systematically outlines each layer of the network and its respective role in the overall detection pipeline. The table highlights the progressive transformation of the input features as they pass through convolutional, pooling, recurrent, and fully connected stages. The convolutional layers serve as the initial feature extractors, identifying spatial–local dependencies in the raw flow data, while max-pooling layers reduce dimensionality and enhance computational efficiency. The recurrent stack, composed of LSTM and GRU layers, captures both long-term temporal dependencies and short-term sequential patterns, offering complementary advantages in terms of memory retention and convergence speed. Dropout layers are strategically integrated throughout the architecture to minimize overfitting and improve model generalization. Finally, the dense layers refine the extracted representations into highly discriminative features, culminating in a softmax output that yields probabilistic classification across target classes. By explicitly enumerating the architecture in this structured form, the table not only enhances reproducibility but also clarifies how the integration of CNN, LSTM, and GRU contributes to the novelty of the proposed hybrid approach compared to conventional models such as CNN–LSTM or BLSTM–GRU.