This section presents the experimental results in three stages. The first part shows the outcomes of the user profiling process based on clustering applied to normalized traffic data. The second part presents the results of merging the user profiling embeddings with the UNSW-NB15 and the CIC-IDS2017 dataset using adversarial training. The third part includes the classification results obtained by training a deep feedforward neural network on the combined dataset. Each stage is evaluated separately to examine its contribution to the overall anomaly detection performance.
4.1. Behavior Profiling
The ultimate goal of this stage is to construct a behavioral dataset from the raw interaction logs collected with Charles Proxy. Since the proxy outputs low-level records of web service usage, cookies, headers, and connection events, these data must be transformed into higher-level user profiles that summarize recurring behavioral patterns. Clustering provides a principled way to group users with similar interaction habits, enabling the generation of representative embeddings that can later enrich the UNSW-NB15 traffic records. To ensure that these clusters capture genuine behavioral regularities rather than artifacts of the algorithms, we rely on several complementary quality metrics. The silhouette score indicates whether individual users are well matched to their assigned cluster compared to neighboring ones, reflecting cohesion and separation. The Calinski–Harabasz index measures the tradeoff between intra-cluster compactness and inter-cluster dispersion, with higher values supporting well-separated user archetypes. The Davies–Bouldin index evaluates the average similarity between clusters, where lower values imply greater distinction among behavioral groups. Finally, the Dunn index assesses the balance between the tightness of each cluster and the distance between clusters, where higher values reflect more robust separation. Taken together, these metrics guide the selection of the clustering configuration, ensuring that the resulting behavioral dataset encodes consistent user profiles that can be meaningfully integrated into anomaly detection experiments.
The clustering results from various methods without using the hybrid model (autoencoder and agglomerative clustering) provide valuable insights into the performance of traditional clustering algorithms. The results are shown in
Figure 2. K-means clustering yields silhouette scores ranging from 0.5048 to 0.4159, indicating a moderate to weak clustering structure. A higher silhouette score generally suggests that the clusters are well-separated, while a lower score implies that the clusters are less distinct. The scores decrease as the number of clusters increases, but the performance does not degrade drastically, suggesting a relatively stable clustering behavior across different values of k.
Clustering results from traditional methods without using the hybrid model (Autoencoder and agglomerative clustering) were analyzed to evaluate the performance of K-means, agglomerative clustering, and DBSCAN.
Figure 2 presents the silhouette scores for each method across a range of cluster numbers.
The silhouette scores for agglomerative clustering range from 0.4739 to 0.3913, exhibiting a trend comparable to K-means clustering. The scores are slightly lower on average than those for K-means, and the performance does not improve significantly with increasing cluster numbers. However, it does not partition the data more effectively than K-means when applied directly to the original data. In contrast, DBSCAN produces a range of scores from 0.4286 to −1, which includes negative values for some parameter choices. The negative silhouette scores indicate that DBSCAN has failed to cluster some data points correctly, likely due to its inability to identify meaningful clusters or because of a high number of noise points. This suggests that DBSCAN may not be as reliable in this case, possibly due to challenges in determining the appropriate density threshold for the data.
When using the hybrid model, the autoencoder reduces the dimensionality of the data to a 2D latent space, allowing the clustering algorithm to focus on the essential features of the data. This approach significantly improves the clustering performance. Agglomerative clustering with the latent features from the autoencoder achieved much higher and more consistent silhouette scores, ranging from 0.7861 to 0.8632. These improved scores show a significant enhancement over traditional clustering methods. The higher silhouette scores indicate that the clusters are better defined, with the hybrid model effectively capturing the underlying structure of the data.
The results demonstrate that the reduced latent representation generated by the Autoencoder enables more effective cluster separation. The clustering performance is more stable across different numbers of clusters, achieving the highest silhouette score with an optimal number of clusters. This suggests that dimensionality reduction helps uncover the inherent structure of the data, enabling more distinct and meaningful clusters.
The hybrid model using an autoencoder followed by agglomerative clustering shows higher performance than K-means, DBSCAN, and agglomerative clustering without dimensionality reduction. The autoencoder extracts a low-dimensional representation of the input, which improves the separation of groups when used as input for clustering. The model achieves higher silhouette scores across tested configurations. Additional clustering metrics support this result. The Calinski–Harabasz index reaches 195.30, indicating separation among groups. The Davies–Bouldin index reaches 0.5775, showing limited overlap between groups. The Dunn index reaches 0.0729, showing that intra-group compactness is not optimal. These metrics reflect consistent clustering performance and provide a basis for comparison across different models.
Although the Davies–Bouldin index is relatively low, it is not minimal due to partial overlap between clusters corresponding to users with intermediate behavior patterns. For example, some students in the mechatronics group exhibit both programming and networking activities, creating latent embeddings that lie between the distinct behavioral groups, increasing the intra-cluster distance relative to inter-cluster separation.
The Calinski–Harabasz index of 195.30 indicates a strong between-cluster separation relative to within-cluster dispersion, confirming that the latent features extracted by the autoencoder highlight distinctive behavioral patterns. The Dunn index of 0.0729 suggests that although clusters are well-separated, intra-cluster compactness could be improved, reflecting variability among users within the same behavioral group.
Overall, the hybrid autoencoder–agglomerative clustering approach demonstrates superior clustering quality compared to traditional methods and PCA baselines. While the silhouette score indicates well-defined clusters, the moderate Davies–Bouldin and Dunn indices reflect the presence of users with mixed or transitional behaviors, which naturally reduces perfect cluster separation. These results confirm that dimensionality reduction via autoencoders captures the non-linear structures in user behavior, enabling more meaningful profiling for downstream anomaly detection.
The clustering analysis validates the feasibility of transforming raw Charles Proxy logs into a structured behavioral dataset suitable for cross-domain enrichment. Traditional methods such as K-means and DBSCAN showed limited separation capacity, highlighting the challenge of capturing complex user interaction patterns directly from high-dimensional data. By contrast, the hybrid autoencoder–agglomerative clustering approach achieved substantially higher and more consistent performance across all evaluation metrics, confirming that dimensionality reduction uncovers latent behavioral structures otherwise hidden in the raw traffic. The combination of silhouette, Calinski–Harabasz, Davies–Bouldin, and Dunn indices provided a balanced view of both cohesion and separation, guiding the selection of the optimal configuration for dataset construction. Although some overlap remains among transitional user groups, the overall results demonstrate that the hybrid model produces stable and interpretable clusters that serve as reliable behavioral embeddings. These embeddings form the foundation of the enriched dataset, ensuring that anomaly detection experiments incorporate context-aware profiles rather than relying solely on low-level traffic parameters.
4.2. Dataset Enrichment
The enrichment stage aims to align traffic-based features from UNSW-NB15 with behavioral embeddings extracted from Charles Proxy logs, creating a joint representation space where both sources contribute complementary information. To assess whether this cross-domain alignment is effective, we rely on representation-level metrics and visualization techniques. Traditional clustering indices are less informative here, since the goal is not to form static user groups but to evaluate how well the adversarial training integrates heterogeneous domains while preserving their internal structures. Therefore, t-distributed Stochastic Neighbor Embedding (t-SNE) is employed to project high-dimensional embeddings into two dimensions, making latent patterns interpretable. This technique is widely used to verify whether embeddings preserve neighborhood relations, expose separable clusters, and reduce domain overlap. In our context, t-SNE allows us to (i) confirm that UNSW traffic and Charles behavioral data remain distinguishable yet aligned in a shared latent space, (ii) inspect whether enrichment tightens the geometry of normal traffic patterns and refines the separation of anomalous samples, and (iii) visually validate that the learned embeddings do not collapse into indistinguishable clusters, which would indicate poor domain alignment. In short, these metrics and visualizations are necessary to ensure that enrichment produces meaningful latent representations rather than simply concatenating features without structural coherence.
To interpret how the enrichment reshapes the latent space, we apply t-distributed Stochastic Neighbor Embedding (t-SNE) to the learned embeddings [
58]. t-SNE projects high-dimensional representations into two dimensions while preserving local neighborhood structure, making cluster formation, overlap, and domain separation visually apparent. This visualization allows us to (i) verify the domain separation achieved by the adversarial alignment between network traffic datasets (UNSW-NB15 and CIC-IDS-2017) and Charles behavioral dat and (ii) inspect how the enrichment modifies the geometry of network embeddings by tightening normal patterns and exposing finer-grained anomalous structures. In other words, the plots provide an intuitive view of how behavioral context changes the arrangement of samples in the latent space, clarifying the mechanism by which the enriched features improve separability for downstream anomaly detection.
Figure 3 presents the t-SNE embeddings obtained from the UNSW-NB15 and Charles datasets. In the left panel, the UNSW embeddings (blue) form several clusters that correspond to distinct categories of network traffic, including normal activity and multiple attack types. The Charles embeddings (red) appear clearly separated from the network traffic, with one dominant cluster representing regular user behavior and scattered points indicating atypical or anomalous activities. This separation illustrates that the learned representations are able to capture differences not only across datasets but also within each domain, distinguishing between baseline and anomalous patterns.
The right panel shows the enriched embeddings for a subset of 100 UNSW samples. The green points form a compact cluster associated with normal traffic, while other clusters capture different categories of malicious activity. The presence of scattered points indicates events that do not conform to the main groups, which may correspond to rare attack signatures. Compared to the original embeddings, this enriched representation yields a clearer separation of attack-related patterns, suggesting that the model captures structural differences in network behavior with greater precision.
Together, both panels indicate that the training process achieved a domain separation between network traffic and user behavior while also preserving the internal structure within each domain. This allows the representation to support the identification of normal, suspicious, and anomalous activities, which is essential for subsequent profiling and detection tasks.
Overall, the enrichment analysis demonstrates that adversarial training successfully integrates user behavior with network traffic features while maintaining the internal structure of each domain. The clear separation between behavioral and traffic embeddings, together with the refined clustering observed in enriched datasets representations, confirms that cross-domain alignment provides a richer latent space for anomaly detection. Normal traffic becomes more compact, anomalous activities gain sharper boundaries, and rare events emerge as distinct outliers, suggesting that the added behavioral context improves the discriminative capacity of the feature space. These findings validate the necessity of the enrichment step; rather than replacing traffic parameters, behavioral embeddings act as a regularization mechanism that enhances the separability of anomalies, laying the foundation for more robust classification in the next stage.
4.3. Anomaly Classification
The anomaly classification stage evaluates the effectiveness of integrating behavioral embeddings into traffic-based models by comparing them with a traffic-only baseline. In this context, the choice of performance metrics is critical, as different indicators capture different aspects of intrusion detection quality. Accuracy measures the overall proportion of correctly classified samples but can be misleading when attack classes are imbalanced. Precision reflects the ability of the model to correctly identify attacks among all traffic labeled as malicious, minimizing false alarms in operational scenarios. Recall quantifies the ability to detect actual attacks, which is crucial to avoid undetected intrusions. The F1-score balances precision and recall, providing a single measure that accounts for both detection capability and false positives. Finally, the ROC-AUC evaluates the model’s discrimination power across thresholds, offering a robust view of performance under varying decision criteria. Together, these metrics provide a comprehensive evaluation framework that goes beyond raw accuracy, ensuring that improvements with behavioral enrichment are meaningful for real-world IDS deployment.
This section evaluates anomaly detection performance by comparing two experimental setups across both UNSW-NB15 and CIC-IDS2017 datasets: a baseline model trained on balanced datasets without enrichment and a model trained on enriched datasets that integrate behavioral profiles through adversarial autoencoder alignment. For both datasets, we sample 30,000 instances from each class (attack and normal traffic) to create balanced training sets of 60,000 samples in total. This configuration ensures fair comparison between baseline and enriched approaches while maintaining consistency across benchmarks.
The evaluation includes learning curve analysis, confusion matrices, and cross-validation metrics to provide a comprehensive assessment of classification performance. The confusion matrices highlight the distribution of false positives and false negatives in each setup, while cross-validation ensures that the results are stable across multiple folds and not tied to a single train–test split.
For the enriched configurations, the integration of behavioral profiles introduces additional variability that avoids perfect memorization. This prevents the classifier from converging to trivial solutions and promotes the learning of more robust decision boundaries. It is important to note that this benefit is specific to the controlled experimental configuration presented here; no model can completely eliminate the risk of overfitting in all scenarios. Nevertheless, the enriched datasets help the classifier balance accuracy and generalization, achieving realistic performance that is more representative of conditions likely to be encountered in deployment.
Our contribution demonstrates the technical feasibility of cross-domain alignment using adversarial autoencoders. The enrichment augments the feature space from 42 to 96 dimensions and produces consistent patterns across two benchmark datasets. We position this as a proof-of-concept for cross-domain enrichment methodologies, showing that behavioral context can be systematically integrated with network traffic features to enhance model robustness.
4.3.1. UNSW-NB15 Results
Figure 4 illustrates the training dynamics for both experimental setups on UNSW-NB15. The model trained on enriched data (
Figure 4a) demonstrates gradual loss reduction over 50 epochs, with training loss decreasing from 0.67 to 0.04 and validation loss stabilizing at 0.05. The convergence pattern shows consistent decrease without abrupt changes, maintaining a small gap between training and validation losses throughout the process. The model trained on balanced original data (
Figure 4b) exhibits rapid convergence within the first 10 epochs. Both training and validation losses drop from initial values of 0.6 and 0.47, respectively, to near-zero values by epoch 6. The losses remain at approximately 0.001 for the remaining training period, creating parallel trajectories with minimal separation.
The convergence behavior reveals fundamental differences in learning dynamics. The enriched model requires extended training periods to achieve convergence, suggesting that the augmented dataset presents a more complex optimization landscape. The sustained loss values indicate that the model continues to encounter variability in the data throughout training. The rapid convergence observed in the model without enriched data indicates that the optimization process quickly identifies patterns that allow near-perfect classification. The immediate drop to near-zero loss suggests that the model rapidly learns to exploit specific characteristics present in the training data. This behavior typically occurs when the dataset lacks sufficient complexity to challenge the model learning capacity.
The maintained gap between training and validation losses in the model with enriched data provides evidence that the data augmentation process successfully introduces regularization effects. The model’s inability to achieve zero loss indicates that the enriched dataset contains sufficient variability to prevent complete memorization of training patterns. These training dynamics support the hypothesis that user behavior profiling enrichment creates learning conditions that require the model to develop more robust decision boundaries rather than relying on dataset-specific artifacts.
Figure 5 presents the confusion matrices for both experimental setups on UNSW-NB15. The model trained on enriched data (
Figure 5a) produces 53 false positives (normal traffic classified as attacks) and 28 false negatives (attacks classified as normal traffic). This results in a false positive rate of 0.87% and a false negative rate of 0.47%. The error distribution demonstrates that the model maintains discrimination capability while exhibiting controlled misclassification patterns. In contrast, the model trained on original data (
Figure 5b) achieves zero misclassifications across all categories. The absence of any false positives or false negatives indicates that the model achieves perfect separation between classes during validation.
The presence of misclassification errors in the enriched model suggests that the data augmentation process introduces variability that prevents the model from achieving perfect memorization of the training patterns. The balanced error distribution between false positives and false negatives indicates that the model does not exhibit bias toward either class. The zero-error performance of the model without enriched data, while appearing optimal, raises concerns about the ability of the model to generalize beyond the specific data distribution encountered during training. Perfect classification performance on validation data typically indicates that the model has learned to exploit specific patterns or artifacts present in the dataset rather than developing robust decision boundaries.
The model trained on the enriched UNSW-NB15 dataset achieved accuracy of 97.17%, precision of 96.95%, recall of 97.34%, F1-score of 97.14%, and ROC-AUC of 99.77%. The training process converged over 31 epochs, with a small gap between training and validation losses. The model trained solely on the balanced original data reached 100% across all measures, with rapid convergence and validation loss approaching zero.
The enriched model demonstrates stable and consistently high performance across all folds, as evidenced by the results in
Table 6. Accuracy, precision, recall, F1-score, and ROC-AUC remain stable, with minimal variation, indicating that the model generalizes well to unseen data and that the high performance is not caused by overfitting.
4.3.2. CIC-IDS2017 Results
Figure 6 illustrates the training dynamics for both experimental setups on CIC-IDS2017. The model trained on enriched data (
Figure 6a) demonstrates a similar gradual convergence pattern, with training and validation losses decreasing progressively over the training period. Both curves converge towards low loss values while maintaining a small but consistent gap, indicating controlled learning without memorization. The model trained on balanced original CIC-IDS2017 data (
Figure 6b) exhibits rapid convergence comparable to the UNSW-NB15 baseline, with both training and validation losses dropping sharply in the initial epochs and stabilizing at near-zero values. This pattern mirrors the behavior observed in the UNSW-NB15 baseline, suggesting consistent overfitting characteristics across datasets when behavioral enrichment is absent.
The consistency of convergence patterns across both datasets provides evidence that behavioral enrichment produces systematic regularization effects independent of the specific network traffic characteristics. The enriched models in both UNSW-NB15 and CIC-IDS2017 require extended training periods and maintain validation gaps, while both baselines converge rapidly to near-perfect training performance.
Figure 7 presents the confusion matrices for both experimental setups on CIC-IDS2017. The model trained on enriched data (
Figure 7a) produces 4 false positives and 4 false negatives, resulting in a false positive rate of 0.02% and a false negative rate of 0.07%. This balanced error distribution, while lower in absolute numbers compared to UNSW-NB15, maintains the characteristic pattern of controlled misclassification that indicates robust generalization. In contrast, the model trained on original CIC-IDS2017 data (
Figure 7b) achieves zero misclassifications across all categories, mirroring the perfect separation observed in the UNSW-NB15 baseline.
The model trained on the enriched CIC-IDS2017 dataset achieved accuracy of 99.93%, precision of 99.93%, recall of 99.93%, F1-score of 99.93%, and ROC-AUC of 99.99%. While these metrics are higher than those observed with UNSW-NB15 enrichment (97%), they remain below the perfect 100% baseline performance, maintaining the characteristic gap that indicates generalization capability. The model trained solely on the balanced original CIC-IDS2017 data reached 100% across all measures, consistent with the baseline behavior observed in UNSW-NB15.
The enriched CIC-IDS2017 model demonstrates remarkably stable performance across all folds, as shown in
Table 7. The minimal variation across folds (all metrics at 99.93%) indicates consistent generalization behavior. The slightly higher absolute performance compared to UNSW-NB15 enrichment may reflect differences in traffic patterns or attack distributions between datasets, but the consistent presence of controlled error rates across both benchmarks confirms the systematic mechanism to avoid the perfect memorization effect of behavioral enrichment.
The observed behavior across both datasets highlights the value of improving the representational richness of the input space. The enriched datasets increase variability and complexity without requiring additional raw data, demonstrating that robust generalization can be achieved through data enrichment rather than architectural modifications. This effect is particularly relevant because it shows that simpler models can reach levels of reliability that would typically require greater capacity or significantly larger datasets when provided with behaviorally enriched features.
While the baseline models achieve perfect validation performance on both balanced datasets, such results are atypical in real-world deployment where data distributions shift, attack variants emerge, and computational constraints favor simpler models. Our enrichment framework demonstrates that behavioral context enables high performance (97%+ for UNSW-NB15, 99.9%+ for CIC-IDS2017) with enhanced robustness, as evidenced by (1) stable cross-validation across folds in both datasets, (2) balanced error distribution preventing bias toward false negatives or positives, and (3) training dynamics that consistently suggest learning generalizable patterns rather than dataset-specific artifacts.
The comparison across two benchmark datasets indicates that data enrichment with user behavior profiling acts as a systematic regularization mechanism, even with simple neural network architectures. By introducing variability through behavioral context derived from adversarial autoencoder alignment, the enriched datasets maintain the complexity of network traffic while preventing the model from memorizing training patterns. The consistency of this effect across UNSW-NB15 and CIC-IDS2017—despite their different traffic characteristics and attack distributions—validates the technical feasibility of cross-domain alignment for intrusion detection.
The performance difference between enriched models (97–99.9%) and baseline models (100%) should not be interpreted as the baseline being superior. Rather, the baseline’s perfect validation scores under balanced conditions represent an upper bound achievable through dataset-specific optimization. The enriched models’ slightly lower but highly stable performances across folds, combined with training dynamics showing sustained variability in both datasets, indicates learning patterns that may generalize better to distribution shifts encountered in deployment. This tradeoff between validation perfection and operational robustness aligns with established principles in machine learning regularization.
These results support the use of cross-domain behavioral enrichment with network traffic data to enhance robustness across architectural designs. The enrichment framework creates training scenarios that prepare models for deployment in network environments where computational efficiency and model interpretability are priorities. The consistency of the regularization effects across two distinct benchmark datasets demonstrates that adversarial autoencoder-based alignment can systematically integrate behavioral context with traffic features, producing models that are less brittle and more reliable for deployment in dynamic network environments.