Condensation of Data and Knowledge for Network Traffic Classification: Techniques, Applications, and Open Issues
Abstract
:1. Introduction
- We provide a comprehensive review and classification of data and knowledge condensation techniques and their applications. These techniques are divided into four categories: coreset selection, data compression, knowledge distillation, and dataset distillation. Coreset techniques are categorized by data type, query set, and construction method, while compression techniques are classified by code type, data type, code quality, and coding schemes. Knowledge distillation methods are grouped by distillation schemes, algorithms, and applications, and dataset distillation is categorized based on distillation methods and applications.
- We emphasize data condensation enablers and recent schemes for network (malicious) traffic classification, highlighting the relationship between data condensation techniques and their application in network classification tasks, particularly in complex, resource-constrained environments.
- We examine the gap between advanced condensation approaches and current network (malicious) traffic classification tasks, identifying key challenges and open issues for future research in applying these techniques to traffic classification.
2. Related Surveys
3. Types of Data
3.1. Structured Data
3.2. Unstructured Data
3.3. Semi-Structured Data
3.4. Summary
4. Condensation of Data and Knowledge
4.1. Coreset Selection
4.1.1. Data Types
- Weighted subset of input: This is a small weighted subset of the original data, approximating the full dataset by applying loss functions, models, classifiers, and hypotheses on the coreset. It has been used to solve ordered clustering problems [41] and to design efficient algorithms for power mean problems in high-dimensional Euclidean space [42].
- Weighted subset of input space: Here, the input data in both the coreset and the original set are drawn from the same ground set or metric space . For example, the coreset S is a subset of , but not of the input set P. Rosman et al. [43] constructed such a coreset for k-segmentation in streaming data, and Tukan et al. [44] used near-convex functions to construct coresets.
- Sketch matrices: In this case, each data point in the coreset is a linear combination of the input data. Feldman et al. [45] used space-efficient sketches of -distances to approximate -regression in low-dimensional data streams, while Karnin and Liberty [46] improved streaming sketches for low discrepancy problems.
- Low-dimensional coresets: These coresets represent data in a low-dimensional space rather than using a small number of points. They are often used to reduce the size of sparse datasets [47].
- Generic data structures: These are hybrids of one or more of the above coreset types. Although handling them can be challenging, they may achieve very small coreset sizes for certain problems [43].
4.1.2. Query Set
- Strong coresets approximate every query in the given query set, providing error guarantees for all queries [48]. Based on the sensitivity framework introduced by [49] and the sensitivity tight bound of [50], it has been proven that a strong -coreset can be obtained through an intelligent reweighting scheme of a normalized weighted input set by selecting a non-uniform random sample of the input.
- Weak coresets are associated with a set of queries but provide error guarantees only for some of them [48]. They can be derived using the Bernstein inequality [51]. Additionally, weak coresets can be obtained from strong coresets by applying non-uniform sampling and reweighting the samples, resulting in smaller coresets compared to those produced by the sensitivity framework [52].
- Sparse coresets provide error guarantees only for the optimal query and are not composable. They typically support specific computational models and do not provide information about the optimal solution of the original data [7]. Sparse coresets can be computed using convex optimization techniques, such as the Frank–Wolfe algorithm [53], or by using the weight vector of the (strong) coreset itself [47].
4.1.3. Constructions
- Uniform sampling divides the input data into subpopulations and samples from each subpopulation in proportion to its size. It is a common method for constructing a kernel and reduces the input time to sublinear time. However, it does not guarantee a -multiplicative error like other coresets and may overlook small or important data points and distant clusters [57].
- Significance sampling constructs a coreset by sampling data from a distribution distinct from the original dataset’s distribution. This distribution is approximated by a weighted average of random draws from the original distribution [58]. It reduces the additive error of the -multiplicative error by replacing a uniform sample with a non-uniform sample of the same size across the input space.
- Grid sampling discretizes the input space into cells and selects a representative from each cell, weighted by the number of input points in the cell. While grid sampling generates kernels with lower additive error, its time and space complexity grows exponentially with the number of cells. Grid-based approaches have been applied to construct coresets for kernel regression [59], dynamic data stream clustering [60], and empirical risk minimization problems [61].
4.2. Compression
4.2.1. Types of Codes
4.2.2. Types of Data
4.2.3. Data Quality
4.2.4. Coding Schema
- Huffman coding is an entropy-based algorithm that generates optimal prefix-free codes, commonly used in lossless data compression. It assigns variable-length codes to input characters based on their frequency, ensuring efficient compression by using shorter codes for more frequent characters. Huffman coding operates by constructing a binary tree, where each leaf node corresponds to a character, and the tree’s structure determines the codes. This method is particularly popular due to its simplicity, fast compression speed, and lack of patent restrictions. Huffman coding is employed in various compression schemes, including Deflate, JPEG, and MP3. For further details, see [67].
- Arithmetic coding is an entropy-based technique used in lossless data compression. Unlike traditional methods like Huffman coding, which assigns a fixed-length code to each symbol, arithmetic coding encodes an entire message as a single number, represented as an arbitrary-precision fraction, where . This allows frequently occurring characters to be represented with fewer bits, leading to more efficient compression. Arithmetic coding offers greater compression efficiency, supports adaptive models, and computes effectively. It is widely used in adaptive text compression, non-adaptive coding, compression of black-and-white images, and coding of integers with arbitrary distributions [79,80].
- Dictionary-based encoding identifies repeated patterns in an input sequence and encodes them using indices to a dictionary. This method is particularly effective when the input contains frequent patterns, which can be categorized as common or uncommon. Common patterns are encoded with shorter codewords, while uncommon ones use longer codes. The dictionary can be either static or dynamic: a static dictionary is used when prior knowledge of the source is available, whereas a dynamic dictionary adapts to the data when such knowledge is absent. If a pattern is not found in the dictionary, it is encoded using a less efficient method. A well-known dictionary-based technique is the Lempel–Ziv algorithm, which replaces frequently occurring patterns with a single symbol and maintains a dictionary of these patterns. The dictionary’s size is usually fixed. Lempel–Ziv is widely used for lossless file compression, particularly for larger files, and is adaptable to various formats such as GIF, PNG, PDF, and TIFF. Popular versions include LZ77 [81], LZ78 [82], and others [83]. These versions are commonly used in compression utilities like Gzip and ZIP.
- Burrows–Wheeler Transform (BWT) is a lossless block-sorting compression technique (Burrows, 1994). Given an input string s, the BWT generates a permutation of s, denoted as , which allows for efficient compression while still enabling the original string to be retrieved. Compressing is more efficient than directly compressing the original string. Key techniques used in BWT compression include the move-to-front transform [84] and run-length coding [85,86]. Research has demonstrated that BWT achieves high compression ratios with relatively low time and space complexity. The Unix bzip2 utility employs BWT for compression. BWT is particularly useful in biological sciences, where genomes often contain many repeats but few runs. For more details, see [87].
- Fractal compression is a lossy technique used to compress digital images, particularly effective for natural images or textures that exhibit self-similarity within the image. By identifying similar regions in the image and converting them into fractal codes, the image can be compressed efficiently [88]. The technique is based on the iterated function system (IFS), which uses the collage theorem to select transformations that optimize the compression result. Fractal image compression typically involves three steps: image segmentation, similarity search, and similarity coding. The similarity search is crucial, as having sufficient similar regions improves the compression quality but increases computational cost. Several fractal image compression techniques are discussed in [89]. This approach has been widely used in commercial applications for both image and video compression [90].
- Wavelet transform is a data compression technique that converts the input from the time–space domain to the time–frequency domain, achieving more efficient compression. Wavelets are functions defined over a finite interval, and the wavelet transform represents an arbitrary function as a linear combination of these wavelets or basis functions. These functions are derived from a prototype wavelet, known as the mother wavelet, through scaling and shifting [91]. Wavelet transform is widely used for image compression, with notable implementations including JPEG 2000, DjVu, and ECW for still images, and JPEG XS, CineForm, and the BBC’s Dirac for video. The discrete wavelet transform is also employed in audio compression [92]. The primary goal is to store image or audio data using as little space as possible. Wavelet transform compression can be either lossless or lossy. For more details, see [93].
- Scalar and vector quantization is a lossy compression technique where a scalar value or a vector is chosen from a finite set of possible values or vectors to represent a sample or an input vector of samples, respectively. By reducing the precision of the input data, quantization significantly lowers the computational complexity of compression. This method is commonly used for image compression. In the context of deep learning, large models consisting of weights, biases, and activations can often be quantized to eight-bit integers, making scalar and vector quantization a useful approach for model compression [94].
4.3. Knowledge Distillation
4.3.1. Distillation Schemes
- Offline distillation is the most common approach, where a teacher model is first pre-trained, and knowledge is transferred to train the student model. Due to advancements in deep learning, a wide range of pre-trained neural network models are available as teachers for various use cases. Offline distillation is well established and relatively easy to implement.
- Online distillation addresses the limitation of large teacher models that may not fit into offline scenarios [96]. In online distillation, both the teacher and student models are updated simultaneously during an end-to-end training process [97]. This method is highly efficient and can be operationalized through parallel computing [98].
- Self-distillation involves using the same model as both the teacher and student, making it a special form of online distillation [99]. Unlike traditional online distillation, which requires two steps—first training the teacher model and then transferring knowledge to the student—self-distillation employs a one-step framework that directly trains the student model. This approach typically results in higher accuracy with less training time [100].
4.3.2. Distillation Algorithms
- Adversarial distillation refers to distillation algorithms that use generative adversarial networks (GANs) to improve the knowledge transfer between teacher and student models. GANs are machine learning frameworks that generate new data with the same statistical characteristics as the training set [101]. Adversarial distillation can be categorized into three main approaches: (1) using an adversarial generator to create synthetic data, which are then added to the training set [102], (2) employing a discriminator to differentiate between samples generated by the teacher and student models [103], and (3) jointly optimizing both teacher and student models in an online framework [104].
- Multi-teacher distillation involves using multiple teacher models for knowledge transfer during the training of a student network [105]. The simplest approach to transferring knowledge from multiple teachers is to use the averaged response from all teachers as the supervision signal [106]. Additionally, both averaged logits and intermediate layer features can be utilized to enhance dissimilarity between different training samples, further improving the distillation process [107].
- Cross-modal distillation refers to distillation algorithms that transfer knowledge between different modalities, particularly when data or labels for certain modalities are unavailable during training or testing [108]. This approach is commonly applied to distill deep models for tasks such as image, video, and human action recognition.
4.3.3. Distillation Applications
- Model distillation approaches have found widespread use in visual recognition tasks, such as image/video classification, segmentation, object detection, and estimation. In face recognition, model distillation enhances accuracy for low-resolution images and improves deployment efficiency. By leveraging various types of knowledge from complex data sources, model distillation helps address challenging scenarios and optimize performance across diverse applications.
- In natural language processing (NLP) applications, particularly in neural machine translation [109] and multilingual representation models [110], model distillation is employed to create lightweight, efficient, and effective models, as typical language models tend to be large and resource-intensive. For question-answering systems, distillation improves the efficiency and robustness of machine reading comprehension. By combining sequence knowledge, model distillation effectively transfers information from large networks to smaller, more efficient ones.
- Model distillation is often used to improve the performance of acoustic deep neural models in real-time speech recognition systems deployed on embedded platforms [111]. Model distillation is also used to train deep neural networks to identify acoustic scenes for audio segments [112], or classify environment sound [113].
4.4. Dataset Distillation
4.4.1. Formulation and Overview
4.4.2. Methods
- Performance matching in dataset distillation, as proposed by Wang et al. [116], is a meta-learning-based approach that optimizes a synthetic dataset to match the performance of models trained on both the synthetic and original datasets. This method uses a bi-level algorithm: in the inner loop, the weights of a differentiable model are updated alongside the synthetic dataset using gradient descent. In the outer loop, models trained in the inner loop are validated on both the synthetic and original datasets, and the performance loss is computed for backpropagation.
- Parameter matching in dataset distillation involves training a neural network on both the original and synthetic datasets, with the goal of ensuring consistency in the network’s parameters. This can be achieved through gradient matching in a single step [117]. However, since the synthetic dataset is updated over multiple steps while the parameters are matched in one step, errors may accumulate. To address this issue, a multi-step parameter matching approach has been proposed [118], where the models trained on both the original and synthetic datasets are optimized to minimize the distance between their parameters.
- Distribution matching aims to generate synthetic data that closely approximate the distribution of real data. This is achieved by minimizing the distance between the two distributions, typically using metrics such as maximum mean discrepancy [119]. However, directly estimating the real data distribution can be computationally expensive and imprecise, especially for high-dimensional data-like images. To overcome this, the maximum mean discrepancy can be approximated using embedding functions [120].
4.4.3. Applications
- Continual learning refers to the ability of a model to sequentially tackle multiple tasks while retaining knowledge from prior tasks, even when old data are no longer accessible [121]. A primary challenge is catastrophic forgetting, where models tend to forget previously learned information when learning new tasks. To address this, it is essential to preserve prior knowledge in a limited memory buffer. Dataset distillation provides a solution by generating a condensed synthetic dataset that retains the knowledge of the original, larger dataset. Recent dataset distillation approaches [117,119,122] have been applied to scenarios involving continual learning, aiding in the retention of valuable information across tasks.
- Federated learning is a decentralized approach where multiple client devices collaborate with a central global server, allowing each client to train a local model on its data and contribute to a global model, without sharing the local data. Its primary goal is to improve the model performance while maintaining data privacy. However, frequent exchanges of models between clients and the global server can incur high costs. Dataset distillation offers a solution by enabling the exchange of distilled datasets, rather than models, thereby reducing the communication overhead while preserving data privacy [123,124,125,126].
- Privacy and security. Machine learning models are vulnerable to various privacy attacks, including membership inference, model reconstruction, property inference, and model extraction. Dataset distillation addresses these concerns by working with synthetic data, which preserves the privacy of the original dataset. This approach allows for machine learning tasks without exposing sensitive information. Notable works in this area are presented in [20,127,128].
- Others. Dataset distillation has applications in recommendation systems [129] and text classification [130], where models typically require large datasets. By significantly reducing the size of the training data, this method lowers the computational load and improves the time efficiency. Additionally, in medical systems, dataset distillation can generate synthetic data to replace the original data, ensuring the protection of patient privacy [131].
5. Condensation in Flow Classification
5.1. Metadata of Flows
5.2. Flow Models
5.3. Traffic Classification Application
- Elephant flow classification applications. Elephant flow classification plays a critical role in data center management. Network traffic is typically divided into elephant and mouse flows, based on their lifetime characteristics. Elephant flows, with long lifetimes, represent a small percentage (10–20%) of total flows but contribute significantly (80–90%) to the overall traffic volume. In contrast, mouse flows, with short lifetimes, make up the majority of flows but only a minor portion of the traffic volume. Classifying flows into these two categories helps inform network management strategies that optimize load balancing, improve link utilization, and mitigate congestion. Elephant flow classification is an essential tool in network traffic engineering, which aims to enhance operational performance, reduce congestion, and control costs. Our prior research has explored various techniques for categorizing elephant flows [132].
- Malicious flow detection. Malicious flows refer to harmful network traffic that can degrade network performance, disrupt normal operations, and compromise devices or servers. Malicious flow detection systems continuously monitor network traffic for signs of suspicious links, files, or connections, and assess whether these links, such as those from blacklisted URLs or command-and-control (C2) sites, constitute malicious activity. These systems leverage large volumes of security data gathered from millions of devices worldwide to identify both known and unknown threats [135]. Due to the encryption of modern Internet traffic, including TCP/UDP port numbers and restricted access to packet payloads, malicious flow detection often relies on machine learning techniques. These methods help train classifiers to identify malicious flows and classify unknown network traffic. For further details on machine learning in malicious flow detection, see [136].
- Continual (incremental) flow classification. Network traffic packets continuously flow through a network, with flows representing sequences of packets that constantly arrive and depart. To accurately classify each flow, a flow classifier must operate on incoming packets in a continual manner. Many traditional flow classification systems assume network traffic characteristics remain stable over time and space, leading to static flow models and classifiers. However, these static models are ill-suited for dynamic environments, where network traffic evolves and flow characteristics vary. The continual (or incremental) learning in flow classification allows for dynamic classifiers that adapt to the changing nature of network traffic. By incorporating continual learning, flow models can be updated to better handle the fluctuations in real-world network traffic, improving detection accuracy and time efficiency, especially for malicious flows [137]. Additionally, continual learning enables models to adapt by updating malicious flow information and improving detection over time [138].
- Federated flow classification. Federated flow classification combines federated learning with network flow classification to create a distributed classifier across multiple clients in a multi-level system [139]. This approach is widely used in mobile networks [4] and the IoT [5] for classifying network or malicious flows, leveraging resource-constrained devices while utilizing edge computing. Federated flow classification allows local traffic to be classified on individual devices, preserving traffic privacy. A global classifier, trained by a central server, improves the performance by integrating insights from multiple local classifiers, without the need to exchange traffic data. This method effectively addresses the challenges related to traffic drift, data privacy, and resource and time efficiency.
5.4. Condensation in Network Traffic Classification
- Applying coreset techniques has facilitated incremental learning for mobile network traffic classification [137] and malicious flow detection across various network scenarios. For example, ref. [140] applied Bayesian coresets in Hilbert space to handle large, redundant datasets in network intrusion detection. This approach used a weighted data subset to maintain both efficiency and accuracy. Similarly, ref. [6] employed coreset techniques to extract relevant information from anomalous data, filtering out unnecessary samples, and developed a faster clustering model that minimized the response time and enhanced the performance of intrusion detection in vehicular ad hoc networks.
- Applying data compression in network flow classification aims to reduce redundancy, thereby minimizing memory and bandwidth usage required for the storage and transmission of traffic data [141]. For instance, ref. [142] utilized differential and Huffman coding to compress data for IoT nodes in wireless sensor networks, conserving energy, memory, and bandwidth. Research has also extended data compression to encrypted and compressed datasets. In [143], an approach named high entropy distinguisher (HEDGE) i proposed that trained models on compressed packets for real-time network traffic classification with comparable accuracy. Similarly, ref. [144] introduced a raw data and engineered features classification (RDEFC) approach, which learned patterns from both statistical tests and raw file fragments, enhancing classification accuracy on encrypted and compressed traffic in wireless networks. In federated learning for network traffic classification, compression techniques are employed to compress local and global models, reducing communication costs. Three compression algorithms are evaluated in [145] for federated learning in wireless traffic classification, as detailed in Table 6.
- Applying model distillation has been effective not only in deep learning for neural networks but also in constructing multimodal models for encrypted network traffic classification, achieving high accuracy with minimal memory and time cost [146]. It has also been used to address catastrophic forgetting in malicious traffic classification through incremental learning [138], and to implement in-network intelligence using lightweight student models, which significantly reduce the model size and training time on network devices [147]. Additionally, ref. [148] proposed lightweight deep learning models with up to 50% and 30% reductions in floating-point operations per second (FLOP) and parameters, respectively, for traffic classification on IoT devices.
- Applying dataset distillation has primarily been used in neural networks for image classification but has not yet been directly applied to network flow or malicious flow classification. However, since network traffic flows can be transformed into virtualized images [149,150], neural network-based dataset distillation could potentially be adapted to distill flow traffic datasets. While [151] proposed a Bayesian multilayer perception (BMLP) method, which reduces input data bytes by distilling important bytes for accurate network traffic classification with low memory and time cost. BMLP differs significantly from the typical dataset distillation approaches discussed in this paper.
5.5. Comparison of Condensation Methods in Network Traffic Classification
Methods | Ref. | Type | Purpose | SR | TR | AR |
---|---|---|---|---|---|---|
Coreset | [140] | Weighted subset of input | Redundancy reduction | 98% | 10–50% | 10–30% |
[137] | Herding section | Size reduction for IL | 44% | 90% | 5.4% | |
[6] | Weighted subset of input | Size reduction in VANET | - | 79+% | 5+% | |
Data compression | [142] | Differential/Huffman coding | Memory/bandwidth saving | 85% | - | 0 |
[143] | ZIP/RARBZIP2/GZIP | Size reduction | 91% | - | 8% | |
[144] | BZIP2/ZIP/RAR/GZIP | Size reduction | - | - | 10% | |
[145] | PQ/QSGD/TopK | Bandwidth saving | 74+% | 74+% | 8% | |
Model distillation | [146] | Parameter matching | Multimodal learning | 82% | 56% | 7% |
[138] | EWC/GEM | Info. forgetting in IL | - | - | +80% | |
[147] | Distribution matching | Model reduction | 80% | 99% | +3% | |
[148] | Two-step distillation | Model compression | 30% | 50% | 0.37% | |
Dataset distillation | [151] | Byte importance distillation | Size reduction | 132 | 234 | 37% |
[149] | - | Traffic image conversion | - | - | - | |
[150] | - | Traffic image conversion | - | - | - |
6. Challenges and Open Issues
- Dealing with the class imbalance in network traffic:Class imbalance is a significant challenge in machine learning (ML), particularly in network traffic classification, where unequal class distributions can severely affect model performance. This issue arises when the majority classes dominate, leading to biased models that perform poorly on minority classes. In network traffic, this imbalance is especially pronounced due to the inherent characteristics of Internet traffic. For instance, malicious traffic is often rare compared to benign traffic. Additionally, in network flow analysis, 90% of the flows are "mouse flows," which are long-lived and consume most of the bandwidth, while only 10% are "elephant flows," which are short-lived and consume little bandwidth.To address class imbalance, two common solutions include random oversampling and synthetic minority oversampling. These techniques aim to balance class distributions by augmenting the minority class with additional samples. However, network traffic classification poses unique challenges for such techniques due to the complexity of network data [152]. Neural networks, in particular, tend to over-classify majority classes due to their increased prior probability [153], which can significantly degrade the performance of models—especially teacher models—in classifying minority classes. This, in turn, exacerbates the difficulty of knowledge distillation, where the accuracy of student models on minority classes is further compromised.To mitigate this, incorporating additional relational information, such as soft labels or pairwise similarities between data points across different classes [154], or combining knowledge distillation with reinforcement learning techniques [155], may help improve the learning of minority classes. For dataset distillation, the challenge becomes even more pronounced, as generating a synthetic dataset that maintains the accuracy for minority classes comparable to the original is difficult. However, this specific aspect has not yet been adequately addressed in the literature, and further research is needed to develop effective strategies for class imbalance in dataset distillation.
- Applying knowledge distillation on non-neural network-based models: Neural networks are large, resource-intensive models, and training them demands significant time and computational power. As such, transferring knowledge from neural networks to non-neural models—or between non-neural models—becomes crucial for extending knowledge distillation to network traffic classification tasks. This also helps improve the efficiency of such tasks, particularly in resource-constrained environments like the network edge. A deep understanding of the types of knowledge present in neural networks and how to effectively combine these different knowledge types is central to the success of knowledge distillation.Non-neural network-based models are commonly used in flow classification and other classification tasks, so developing methods for transferring knowledge to and from these models presents a challenge. One possible solution is to convert non-neural models into neural networks, thereby leveraging existing knowledge distillation techniques. Some researchers suggest using decision tree-based models for knowledge distillation, as decision trees can be derived from or mapped to neural networks [156,157]. However, this approach may face significant challenges in enhancing resource efficiency, given the inherent resource demands of neural networks.
- Using dataset distillation in non-image classification tasks: The application of dataset distillation in non-image classification tasks is an emerging area of research. While dataset distillation is well established in neural networks for image classification, its effectiveness in non-image tasks, such as network traffic classification, remains less explored. One potential approach is converting structural flow datasets into images, which has been proposed for tasks like malicious flow classification [150] and encryption flow classification [158]. This transformation allows dataset distillation to reduce resource consumption and improve the time efficiency in flow classification tasks.Furthermore, recent studies indicate that text-based network traffic flows can be represented as bit arrays, enabling high-accuracy malicious traffic classification [159,160]. This suggests that dataset distillation techniques, whether image-based or not, can be applied to non-image classification tasks as well. Additionally, the relationship between decision trees and neural networks has been explored, opening the possibility of using decision trees and decision tree-based algorithms for distilling datasets in non-image tasks.In cases where gradient descent methods are not applicable, generic search algorithms, such as grid search and genetic algorithms, can be used to optimize synthetic datasets for non-image classification models. This broadens the scope of dataset distillation, making it adaptable to various types of machine learning models beyond neural networks.
- Combining multiple data condensation techniques: Given the resource constraints, concept drifts, and catastrophic forgetting issues prevalent in current network traffic classification scenarios, it is essential to combine various data condensation techniques. Classic techniques like coreset selection and data compression can be integrated with advanced methods such as model and dataset distillation to enhance performance in resource-constrained environments.In knowledge distillation, it is crucial to compress the teacher model by eliminating redundancies before transferring knowledge to a student model. In dataset distillation, constructing a coreset from the original dataset plays a vital role in reducing dataset size and guiding the distillation of the synthetic dataset. Compression techniques can also be applied to real-world datasets to minimize redundancy and reduce training time in flow classification tasks.To address real-world time and resource limitations, a combination of multiple dataset condensation techniques is necessary. Given that network traffic is often encrypted and compressed for privacy and resource efficiency, investigating malicious flow models on encrypted and compressed network traffic has become a key area of research [143,144].
- Applying data condensation on the network edge: Network edge devices are typically characterized by resource constraints in terms of computation, storage, bandwidth, and power, making it challenging to deploy traffic classification applications effectively. Data condensation plays a critical role in enhancing the efficiency and feasibility of such applications on these devices. Given that these network devices are commonly used in environments like wireless sensor networks, mobile networks, and vehicular ad hoc networks, which have stringent performance requirements—such as low response times, high robustness, and minimal latency—deploying network traffic classification tasks becomes highly complex.Additionally, migrating from large models to smaller, more efficient ones without significant performance degradation, ensuring robust data privacy, maintaining low latency, and improving network interoperability all add layers of complexity to the deployment process. Techniques like data condensation on programmable network datapaths (e.g., P4) and offloading in SDN devices [161] can enhance network performance while enabling in-network intelligence, providing a pathway to address these deployment challenges.
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
ASCII | American Standard Code for Information Interchange |
EBCDIC | Extended Binary Coded Decimal Interchange Code |
GNN | Graph Neural Network |
IoT | Internet of Things |
VANET | Vehicular Ad Hoc Network |
SDN | Software-Defined Networking |
SQL | Structured Query Language |
TCP | Transmission Control Protocol |
UDP | User Datagram Protocol |
URL | Uniform Resource Locator |
XML | Extensible Markup Language |
References
- Pacheco, F.; Exposito, E.; Gineste, M.; Baudoin, C.; Aguilar, J. Towards the deployment of machine learning solutions in network traffic classification: A systematic survey. IEEE Commun. Surv. Tutor. 2018, 21, 1988–2014. [Google Scholar]
- Lu, J.; Liu, A.; Dong, F.; Gu, F.; Gama, J.; Zhang, G. Learning under concept drift: A review. IEEE Trans. Knowl. Data Eng. 2018, 31, 2346–2363. [Google Scholar]
- Filali, A.; Abouaomar, A.; Cherkaoui, S.; Kobbane, A.; Guizani, M. Multi-access edge computing: A survey. IEEE Access 2020, 8, 197017–197046. [Google Scholar] [CrossRef]
- Lim, W.Y.B.; Luong, N.C.; Hoang, D.T.; Jiao, Y.; Liang, Y.C.; Yang, Q.; Niyato, D.; Miao, C. Federated learning in mobile edge networks: A comprehensive survey. IEEE Commun. Surv. Tutor. 2020, 22, 2031–2063. [Google Scholar]
- Nguyen, D.C.; Ding, M.; Pathirana, P.N.; Seneviratne, A.; Li, J.; Poor, H.V. Federated learning for internet of things: A comprehensive survey. IEEE Commun. Surv. Tutor. 2021, 23, 1622–1658. [Google Scholar] [CrossRef]
- Bangui, H.; Ge, M.; Buhnova, B. A hybrid machine learning model for intrusion detection in VANET. Computing 2022, 104, 503–531. [Google Scholar] [CrossRef]
- Feldman, D. Core-sets: Updated survey. In Sampling Techniques for Supervised or Unsupervised Tasks; Springer: Cham, Switzerland, 2020; pp. 23–44. [Google Scholar]
- Jubran, I.; Maalouf, A.; Feldman, D. Overview of accurate coresets. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2021, 11, e1429. [Google Scholar]
- Huang, L.; Sudhir, K.; Vishnoi, N. Coresets for time series clustering. Adv. Neural Inf. Process. Syst. 2021, 34, 22849–22862. [Google Scholar]
- Jayasankar, U.; Thirumal, V.; Ponnurangam, D. A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications. J. King Saud-Univ.-Comput. Inf. Sci. 2021, 33, 119–140. [Google Scholar] [CrossRef]
- Lungisani, B.A.; Lebekwe, C.K.; Zungeru, A.M.; Yahya, A. Image compression techniques in wireless sensor networks: A survey and comparison. IEEE Access 2022, 10, 82511–82530. [Google Scholar] [CrossRef]
- Bidwe, R.V.; Mishra, S.; Patil, S.; Shaw, K.; Vora, D.R.; Kotecha, K.; Zope, B. Deep learning approaches for video compression: A bibliometric analysis. Big Data Cogn. Comput. 2022, 6, 44. [Google Scholar] [CrossRef]
- Chiarot, G.; Silvestri, C. Time series compression survey. ACM Comput. Surv. 2023, 55, 1–32. [Google Scholar] [CrossRef]
- Jia, W.; Sun, M.; Lian, J.; Hou, S. Feature dimensionality reduction: A review. Complex Intell. Syst. 2022, 8, 2663–2693. [Google Scholar] [CrossRef]
- Tang, Y.; Chen, D.; Li, X. Dimensionality reduction methods for brain imaging data (BID) analysis. ACM Comput. Surv. 2021, 54, 1–36. [Google Scholar] [CrossRef]
- Hu, C.; Li, X.; Liu, D.; Wu, H.; Chen, X.; Wang, J.; Liu, X. Teacher-Student Architecture for Knowledge Distillation: A Survey. arXiv 2023, arXiv:2308.04268. [Google Scholar]
- Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
- Tian, Y.; Pei, S.; Zhang, X.; Zhang, C.; Chawla, N.V. Knowledge Distillation on Graphs: A Survey. arXiv 2023, arXiv:2302.00219. [Google Scholar] [CrossRef]
- Alkhulaifi, A.; Alsahli, F.; Ahmad, I. Knowledge distillation in deep learning and its applications. PeerJ Comput. Sci. 2021, 7, e474. [Google Scholar] [CrossRef]
- Liu, Y.; Li, Z.; Backes, M.; Shen, Y.; Zhang, Y. Backdoor attacks against dataset distillation. arXiv 2023, arXiv:2301.01197. [Google Scholar]
- Luo, W. A comprehensive survey on knowledge distillation of diffusion models. arXiv 2023, arXiv:2304.04262. [Google Scholar]
- Li, Z.; Xu, P.; Chang, X.; Yang, L.; Zhang, Y.; Yao, L.; Chen, X. When object detection meets knowledge distillation: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10555–10579. [Google Scholar] [CrossRef]
- Meng, H.; Lin, Z.; Yang, F.; Xu, Y.; Cui, L. Knowledge distillation in medical data mining: A survey. In Proceedings of the 5th International Conference on Crowd Science and Engineering, Jinan, China, 16–18 October 2021; pp. 175–182. [Google Scholar]
- Yu, R.; Liu, S.; Wang, X. Dataset distillation: A comprehensive review. arXiv 2023, arXiv:2301.07014. [Google Scholar] [CrossRef] [PubMed]
- Lei, S.; Tao, D. A comprehensive survey to dataset distillation. arXiv 2023, arXiv:2301.05603. [Google Scholar] [CrossRef] [PubMed]
- Sachdeva, N.; McAuley, J. Data distillation: A survey. arXiv 2023, arXiv:2301.04272. [Google Scholar]
- Cafarella, M.J.; Halevy, A.; Madhavan, J. Structured data on the web. Commun. ACM 2011, 54, 72–79. [Google Scholar] [CrossRef]
- Doan, A.; Naughton, J.; Baid, A.; Chai, X.; Chen, F.; Chen, T.; Chu, E.; DeRose, P.; Gao, B.; Gokhale, C.; et al. The case for a structured approach to managing unstructured data. arXiv 2009, arXiv:0909.1783. [Google Scholar]
- Arasu, A.; Garcia-Molina, H. Extracting structured data from web pages. In Proceedings of the ACM SIGMOD’03, San Diego, CA, USA, 9–12 June 2003; pp. 337–348. [Google Scholar]
- Feldman, R.; Sanger, J. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
- Van Dijck, J. The Culture of Connectivity: A Critical History of Social Media; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
- Feldman, S.; Hanover, J.; Burghard, C.; Schubmehl, D. Unlocking the power of unstructured data. IDC Health Insights 2012, 2012, 1–10. [Google Scholar]
- Eberendu, A.C. Unstructured Data: An overview of the data of Big Data. Int. J. Comput. Trends Technol. 2016, 38, 46–50. [Google Scholar] [CrossRef]
- Srivastava, D.; Velegrakis, Y. Intentional associations between data and metadata. In Proceedings of the SIGMOD’07, Beijing, China, 11–14 June 2007; pp. 401–412. [Google Scholar]
- Siadat, M.R.; Soltanian-Zadeh, H.; Fotouhi, F.; Eetemadi, A.; Elisevich, K. Data modeling for content-based support environment (C-BASE): Application on epilepsy data mining. In Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, Omaha, NE, USA, 28–31 October 2007; pp. 181–188. [Google Scholar]
- Amato, G.; Mainetto, G.; Savino, P. An approach to a content-based retrieval of multimedia data. In Multimedia Information Systems; Subrahmanian, V.S., Tripathi, S.K., Eds.; Springer: Boston, MA, USA, 1998; pp. 9–36. [Google Scholar]
- Li, W.; Lang, B. A tetrahedral data model for unstructured data management. Sci. China Inf. Sci. 2010, 53, 1497–1510. [Google Scholar] [CrossRef]
- Li, G.; Ooi, B.C.; Feng, J.; Wang, J.; Zhou, L. EASE: An effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In Proceedings of the ACM SIGMOD’08, Vancouver, BC, Canada, 9–12 June 2008; pp. 903–914. [Google Scholar]
- Asai, T.; Abe, K.; Kawasoe, S.; Sakamoto, H.; Arimura, H.; Arikawa, S. Efficient substructure discovery from large semi-structured data. IEICE Trans. Inf. Syst. 2004, 87, 2754–2763. [Google Scholar]
- Provost, F.; Kolluri, V. A survey of methods for scaling up inductive algorithms. Data Min. Knowl. Discov. 1999, 3, 131–169. [Google Scholar] [CrossRef]
- Braverman, V.; Jiang, S.H.C.; Krauthgamer, R.; Wu, X. Coresets for ordered weighted clustering. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 744–753. [Google Scholar]
- Cohen-Addad, V.; Saulpic, D.; Schwiegelshohn, C. Improved coresets and sublinear algorithms for power means in euclidean spaces. Adv. Neural Inf. Process. Syst. 2021, 34, 21085–21098. [Google Scholar]
- Rosman, G.; Volkov, M.; Feldman, D.; Fisher, J.W., III; Rus, D. Coresets for k-segmentation of streaming data. In Proceedings of the Conference of Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; p. 27. [Google Scholar]
- Tukan, M.; Maalouf, A.; Feldman, D. Coresets for near-convex functions. Adv. Neural Inf. Process. Syst. 2020, 33, 997–1009. [Google Scholar]
- Feldman, D.; Monemizadeh, M.; Sohler, C.; Woodruff, D.P. Coresets and sketches for high dimensional subspace approximation problems. In Proceedings of the 21st annual ACM-SIAM symposium on Discrete Algorithms, Austin, TX, USA, 17–19 January 2010; pp. 630–649. [Google Scholar]
- Karnin, Z.; Liberty, E. Discrepancy, coresets, and sketches in machine learning. In Proceedings of the 32nd Conference on Learning Theory, Phoenix, AZ, USA, 25–28 June 2019; pp. 1975–1993. [Google Scholar]
- Feldman, D.; Volkov, M.; Rus, D. Dimensionality reduction of massive sparse datasets using coresets. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; p. 29. [Google Scholar]
- Phillips, J.M. Coresets and sketches. In Handbook of Discrete and Computational Geometry; Toth, C.D., O’Rourke, J., Goodman, J.E., Eds.; Chapman and Hall/CRC: New York, NY, USA, 2017; pp. 1269–1288. [Google Scholar]
- Braverman, V.; Feldman, D.; Lang, H.; Statman, A.; Zhou, S. New frameworks for offline and streaming coreset constructions. arXiv 2016, arXiv:1612.00889. [Google Scholar]
- Tremblay, N.; Barthelmé, S.; Amblard, P.O. Determinantal Point Processes for Coresets. J. Mach. Learn. Res. 2019, 20, 1–70. [Google Scholar]
- Tropp, J.A. An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 2015, 8, 1–230. [Google Scholar]
- Maalouf, A.; Jubran, I.; Feldman, D. Introduction to coresets: Approximated mean. arXiv 2021, arXiv:2111.03046. [Google Scholar]
- Clarkson, K.L. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Trans. Algorithms 2010, 6, 63. [Google Scholar] [CrossRef]
- Catlett, J. Megainduction: Machine Learning on Very Large Databases. Ph.D. Thesis, Basser Department of Computer Science, University of Sydney, Darlington, Australia, 1991. [Google Scholar]
- Lewis, D.D.; Catlett, J. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the 11th International Conference in Machine Learning, New Brunswick, NJ, Canada, 10–13 July 1994; pp. 148–156. [Google Scholar]
- Roy, N.; McCallum, A. Toward optimal active learning through sampling estimation of error reduction. In Proceedings of the 18th International Conference on Machine Learning, Vancouver, BC, Canada, 13–19 July 2001. [Google Scholar]
- Cohen, M.B.; Lee, Y.T.; Musco, C.; Musco, C.; Peng, R.; Sidford, A. Uniform sampling for matrix approximation. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, New York, NJ, USA, 11–15 January 2015; pp. 181–190. [Google Scholar]
- Tokdar, S.T.; Kass, R.E. Importance sampling: A review. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 54–60. [Google Scholar] [CrossRef]
- Zheng, Y.; Phillips, J.M. Coresets for kernel regression. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 645–654. [Google Scholar]
- Braverman, V.; Frahling, G.; Lang, H.; Sohler, C.; Yang, L.F. Clustering high dimensional dynamic data streams. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 July 2017; pp. 576–585. [Google Scholar]
- Chen, J.; Yang, Q.; Huang, R.; Ding, H. Coresets for Relational Data and the Applications. Adv. Neural Inf. Process. Syst. 2022, 35, 434–448. [Google Scholar]
- Campbell, T.; Broderick, T. Bayesian coreset construction via greedy iterative geodesic ascent. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 698–706. [Google Scholar]
- Ding, H.; Yu, H.; Wang, Z. Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction. arXiv 2019, arXiv:1901.08219. [Google Scholar]
- Ash, R.B. Information Theory; Courier Corporation: North Chelmsford, NC, USA, 2012. [Google Scholar]
- Sayood, K. Introduction to Data Compression; Morgan Kaufmann: Burlington, MA, USA, 2017. [Google Scholar]
- Lelewer, D.A.; Hirschberg, D.S. Data compression. ACM Comput. Surv. 1987, 19, 261–296. [Google Scholar]
- Moffat, A. Huffman coding. ACM Comput. Surv. (CSUR) 2019, 52, 1–35. [Google Scholar]
- Gasieniec, L.; Karpinski, M.; Plandowski, W.; Rytter, W. Efficient algorithms for Lempel-Ziv encoding. In Proceedings of the Algorithm Theory—SWAT’96, Reykjavík, Iceland, 3–5 July 1996; pp. 392–403. [Google Scholar]
- Rissanen, J.; Langdon, G.G. Arithmetic coding. IBM J. Res. Dev. 1979, 23, 149–162. [Google Scholar] [CrossRef]
- Bell, T.; Witten, I.H.; Cleary, J.G. Modeling for text compression. ACM Comput. Surv. (CSUR) 1989, 21, 557–591. [Google Scholar] [CrossRef]
- Jain, A.K. Image data compression: A review. Proc. IEEE 1981, 69, 349–389. [Google Scholar]
- Jayant, N.S.; Chen, E.Y. Audio compression: Technology and applications. AT&T Tech. J. 1995, 74, 23–34. [Google Scholar]
- Denil, M.; Shakibi, B.; Dinh, L.; Ranzato, M.A.; De Freitas, N. Predicting parameters in deep learning. Adv. Neural Inf. Process. Syst. 2013, 26. [Google Scholar]
- Gupta, M.; Agrawal, P. Compression of deep learning models for text: A survey. ACM Trans. Knowl. Discov. Data (TKDD) 2022, 16, 1–55. [Google Scholar]
- Kavitha, P. A survey on lossless and lossy data compression methods. Int. J. Comput. Sci. Eng. Technol. 2016, 7, 110–114. [Google Scholar]
- Cappello, F.; Di, S.; Li, S.; Liang, X.; Gok, A.M.; Tao, D.; Yoon, C.H.; Wu, X.C.; Alexeev, Y.; Chong, F.T. Use cases of lossy compression for floating-point data in scientific data sets. Int. J. High Perform. Comput. Appl. 2019, 33, 1201–1220. [Google Scholar] [CrossRef]
- Correa, J.D.A.; Pinto, A.S.R.; Montez, C. Lossy data compression for iot sensors: A review. Internet Things 2022, 19, 100516. [Google Scholar] [CrossRef]
- Cheng, Y.; Wang, D.; Zhou, P.; Zhang, T. A survey of model compression and acceleration for deep neural networks. arXiv 2017, arXiv:1710.09282. [Google Scholar]
- Witten, I.H.; Neal, R.M.; Cleary, J.G. Arithmetic coding for data compression. Commun. ACM 1987, 30, 520–540. [Google Scholar] [CrossRef]
- Moffat, A.; Neal, R.M.; Witten, I.H. Arithmetic coding revisited. ACM Trans. Inf. Syst. (TOIS) 1998, 16, 256–294. [Google Scholar] [CrossRef]
- Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 1977, 23, 337–343. [Google Scholar] [CrossRef]
- Ziv, J.; Lempel, A. Compression of individual sequences via variable-length coding. IEEE Trans. Inf. Theory 1978, 24, 530–536. [Google Scholar] [CrossRef]
- Azeem, S.; Khan, A.; Qamar, E.; Tariq, U.; Shabbir, J. A Survey: Different Loss-less Compression Techniques. Int. J. Technol. Res. 2016, 4, 1–4. [Google Scholar]
- Bentley, J.L.; Sleator, D.D.; Tarjan, R.E.; Wei, V.K. Locally Adaptive Data Compression Scheme. Commun. ACM 1986, 29, 320–330. [Google Scholar] [CrossRef]
- Golomb, S. Run-length encodings (corresp.). IEEE Trans. Inf. Theory 1966, 12, 399–401. [Google Scholar] [CrossRef]
- Manzini, G. An analysis of the Burrows—Wheeler transform. J. ACM (JACM) 2001, 48, 407–430. [Google Scholar] [CrossRef]
- Daykin, J.W.; Groult, R.; Guesnet, Y.; Lecroq, T.; Lefebvre, A.; Léonard, M.; Prieur-Gaston, E. A survey of string orderings and their application to the Burrows–Wheeler transform. Theor. Comput. Sci. 2018, 710, 52–65. [Google Scholar] [CrossRef]
- Fisher, Y. Fractal image compression. Fractals 1994, 2, 347–361. [Google Scholar] [CrossRef]
- Fathi, A.; Abduljabbar, N.S. Survey on Fractal image compression. J. Res. Sci. Eng. Technol. 2020, 8, 5–8. [Google Scholar] [CrossRef]
- Husain, A.; Nanda, M.N.; Chowdary, M.S.; Sajid, M. Fractals: An Eclectic Survey, Part II. Fractal Fract. 2022, 6, 379. [Google Scholar] [CrossRef]
- Bentley, P.M.; McDonnell, J.T.E. Wavelet transforms: An introduction. Electron. Commun. Eng. J. 1994, 6, 175–186. [Google Scholar] [CrossRef]
- Ramakrishnan, A.G.; Saha, S. ECG coding by wavelet-based linear prediction. IEEE Trans. Biomed. Eng. 1997, 44, 1253–1261. [Google Scholar] [CrossRef]
- Sifuzzaman, M.; Islam, M.R.; Ali, M.Z. Application of wavelet transform and its advantages compared to Fourier transform. J. Phys. Sci. 2009, 13, 121–134. [Google Scholar]
- Liang, T.; Glossner, J.; Wang, L.; Shi, S.; Zhang, X. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 2021, 461, 370–403. [Google Scholar] [CrossRef]
- Yang, C.; Yu, X.; An, Z.; Xu, Y. Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation. In Advancements in Knowledge Distillation: Towards New Horizons of Intelligent Systems; Springer International Publishing: Cham, Switzerland, 2023; pp. 1–32. [Google Scholar]
- Li, L.; Jin, Z. Shadow knowledge distillation: Bridging offline and online knowledge transfer. Adv. Neural Inf. Process. Syst. 2022, 35, 635–649. [Google Scholar]
- Mullapudi, R.T.; Chen, S.; Zhang, K.; Ramanan, D.; Fatahalian, K. Online model distillation for efficient video inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3573–3582. [Google Scholar]
- Anil, R.; Pereyra, G.; Passos, A.; Orm, I.R.; Dahl, G.E.; Hinton, G.E. Large scale distributed neural network training through online distillation. arXiv 2018, arXiv:1804.03235. [Google Scholar]
- Zhang, L.; Bao, C.; Ma, K. Self-distillation: Towards efficient and compact neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4388–4403. [Google Scholar]
- Zhang, L.; Song, J.; Gao, A.; Chen, J.; Bao, C.; Ma, K. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3713–3722. [Google Scholar]
- Zhu, J.; Yao, J.; Han, B.; Zhang, J.; Liu, T.; Niu, G.; Zhou, J.; Xu, J.; Yang, H. Reliable adversarial distillation with unreliable teachers. arXiv 2021, arXiv:2106.04928. [Google Scholar]
- Ye, J.; Ji, Y.; Wang, X.; Gao, X.; Song, M. Data-free knowledge amalgamation via group-stack dual-gan. In Proceedings of the IEEECVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12516–12525. [Google Scholar]
- Xu, Z.; Hsu, Y.C.; Huang, J. Training shallow and thin networks for acceleration via knowledge distillation with conditional adversarial networks. arXiv 2017, arXiv:1709.00513. [Google Scholar]
- Chung, I.; Park, S.; Kim, J.; Kwak, N. Feature-map-level online adversarial knowledge distillation. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 2006–2015. [Google Scholar]
- Liu, Y.; Zhang, W.; Wang, J. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing 2020, 415, 106–113. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- You, S.; Xu, C.; Xu, C.; Tao, D. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1285–1294. [Google Scholar]
- Gupta, S.; Hoffman, J.; Malik, J. Cross modal distillation for supervision transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2827–2836. [Google Scholar]
- Hahn, S.; Choi, H. Self-knowledge distillation in natural language processing. arXiv 2019, arXiv:1908.01851. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Thakker, M.; Eskimez, S.E.; Yoshioka, T.; Wang, H. Fast real-time personalized speech enhancement: End-to-end enhancement network (E3Net) and knowledge distillation. arXiv 2022, arXiv:2204.00771. [Google Scholar]
- Jung, J.W.; Heo, H.S.; Shim, H.J.; Yu, H.J. Knowledge distillation in acoustic scene classification. IEEE Access 2020, 8, 166870–166879. [Google Scholar]
- Tripathi, A.M.; Pandey, O.J. Divide and distill: New outlooks on knowledge distillation for environmental sound classification. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1100–1113. [Google Scholar] [CrossRef]
- Chen, G.; Chen, J.; Feng, F.; Zhou, S.; He, X. Unbiased Knowledge Distillation for Recommendation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; pp. 976–984. [Google Scholar]
- Gil, Y.; Chai, Y.; Gorodissky, O.; Berant, J. White-to-black: Efficient distillation of black-box adversarial attacks. arXiv 2019, arXiv:1904.02405. [Google Scholar]
- Wang, T.; Zhu, J.Y.; Torralba, A.; Efros, A.A. Dataset distillation. arXiv 2018, arXiv:1811.10959. [Google Scholar]
- Zhao, B.; Mopuri, K.R.; Bilen, H. Dataset condensation with gradient matching. arXiv 2020, arXiv:2006.05929. [Google Scholar]
- Cazenavette, G.; Wang, T.; Torralba, A.; Efros, A.A.; Zhu, J.-Y. Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4750–4759. [Google Scholar]
- Zhao, B.; Bilen, H. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 6514–6523. [Google Scholar]
- Wang, K.; Zhao, B.; Peng, X.; Zhu, Z.; Yang, S.; Wang, S.; Huang, G.; Bilen, H.; Wang, X.; You, Y. Cafe: Learning to condense dataset by aligning features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 12196–12205. [Google Scholar]
- Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef]
- Rosasco, A.; Carta, A.; Cossu, A.; Lomonaco, V.; Bacciu, D. Distilled replay: Overcoming forgetting through synthetic samples. In Proceedings of the International Workshop on Continual Semi-Supervised Learning, Virtual Event, 19–20 August 2021; pp. 104–117. [Google Scholar]
- Goetz, J.; Tewari, A. Federated learning via synthetic data. arXiv 2020, arXiv:2008.04489. [Google Scholar]
- Zhou, Y.; Pu, G.; Ma, X.; Li, X.; Wu, D. Distilled one-shot federated learning. arXiv 2020, arXiv:2009.07999. [Google Scholar]
- Xiong, Y.; Wang, R.; Cheng, M.; Yu, F.; Hsieh, C.J. Feddm: Iterative distribution matching for communication-efficient federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2023; pp. 16323–16332. [Google Scholar]
- Song, R.; Liu, D.; Chen, D.Z.; Festag, A.; Trinitis, C.; Schulz, M.; Knoll, A. Federated learning via decentralized dataset distillation in resource-constrained edge environments. In Proceedings of the 2023 International Joint Conference on Neural Networks, Gold Coast, Australia, 18–23 June 2023; pp. 1–10. [Google Scholar]
- Sucholutsky, I.; Schonlau, M. Secdd: Efficient and secure method for remotely training neural networks. arXiv 2020, arXiv:2009.09155. [Google Scholar]
- Chen, D.; Kerkouche, R.; Fritz, M. Private set generation with discriminative information. Adv. Neural Inf. Process. Syst. 2022, 35, 14678–14690. [Google Scholar]
- Sachdeva, N.; Dhaliwal, M.; Wu, C.J.; McAuley, J. Infinite recommendation networks: A data-centric approach. Adv. Neural Inf. Process. Syst. 2022, 35, 31292–31305. [Google Scholar]
- Li, Y.; Li, W. Data distillation for text classification. arXiv 2021, arXiv:2104.08448. [Google Scholar]
- Li, G.; Togo, R.; Ogawa, T.; Haseyama, M. Compressed gastric image generation based on soft-label dataset distillation for medical data sharing. Comput. Methods Programs Biomed. 2022, 227, 107189. [Google Scholar] [CrossRef]
- Liao, L.X.; Chao, H.C.; Chen, M.Y. Intelligently modeling, detecting, and scheduling elephant flows in software defined energy cloud: A survey. J. Parallel Distrib. Comput. 2020, 146, 64–78. [Google Scholar] [CrossRef]
- Rezaei, S.; Liu, X. Deep learning for encrypted traffic classification: An overview. IEEE Commun. Mag. 2019, 57, 76–81. [Google Scholar]
- Tahaei, H.; Afifi, F.; Asemi, A.; Zaki, F.; Anuar, N.B. The rise of traffic classification in IoT networks: A survey. J. Netw. Comput. Appl. 2020, 154, 102538. [Google Scholar] [CrossRef]
- Gandotra, E.; Bansal, D.; Sofat, S. Malware analysis and classification: A survey. J. Inf. Secur. 2014, 2014. [Google Scholar] [CrossRef]
- Abusitta, A.; Li, M.Q.; Fung, B.C. Malware classification and composition analysis: A survey of recent developments. J. Inf. Secur. Appl. 2021, 59, 102828. [Google Scholar] [CrossRef]
- Bovenzi, G.; Yang, L.; Finamore, A.; Aceto, G.; Ciuonzo, D.; Pescape, A.; Rossi, D. A first look at class incremental learning in deep learning mobile traffic classification. arXiv 2021, arXiv:2107.04464. [Google Scholar]
- Amalapuram, S.K.; Tadwai, A.; Vinta, R.; Channappayya, S.S.; Tamma, B.R. Continual learning for anomaly based network intrusion detection. In Proceedings of the 2022 14th International Conference on COMmunication Systems & NETworkS, Bangalore, India, 4–8 January 2022; pp. 497–505. [Google Scholar]
- Wahab, O.A.; Mourad, A.; Otrok, H.; Taleb, T. Federated machine learning: Survey, multi-level classification, desirable criteria and future directions in communication and networking systems. IEEE Commun. Surv. Tutor. 2021, 23, 1342–1397. [Google Scholar] [CrossRef]
- Zennaro, F.M. Analyzing and storing network intrusion detection data using bayesian coresets: A preliminary study in offline and streaming settings. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Turin, Italy, 18–22 September 2019; pp. 208–222. [Google Scholar]
- Callado, A.; Kamienski, C.; Szabó, G.; Gero, B.P.; Kelner, J.; Fernandes, S.; Sadok, D. A survey on internet traffic identification. IEEE Commun. Surv. Tutor. 2009, 11, 37–52. [Google Scholar] [CrossRef]
- Al-Qurabat, A.K.M.; Mohammed, Z.A.; Hussein, Z.J. Data traffic management based on compression and MDL techniques for smart agriculture in IoT. Wirel. Pers. Commun. 2021, 120, 2227–2258. [Google Scholar] [CrossRef]
- Casino, F.; Choo, K.K.R.; Patsakis, C. HEDGE: Efficient traffic classification of encrypted and compressed packets. IEEE Trans. Inf. Forensics Secur. 2019, 14, 2916–2926. [Google Scholar] [CrossRef]
- Saleh, M.M.; AlSlaiman, M.; Salman, M.I.; Wang, B. Combining raw data and engineered features for optimizing encrypted and compressed internet of things traffic classification. Comput. Secur. 2023, 130, 103287. [Google Scholar] [CrossRef]
- Zhang, X.; Zhu, X.; Wang, J.; Yan, H.; Chen, H.; Bao, W. Federated learning with adaptive communication compression under dynamic bandwidth and unreliable networks. Inf. Sci. 2020, 540, 242–262. [Google Scholar] [CrossRef]
- Aceto, G.; Ciuonzo, D.; Montieri, A.; Pescapé, A. DISTILLER: Encrypted traffic classification via multimodal multitask deep learning. J. Netw. Comput. Appl. 2021, 183, 102985. [Google Scholar] [CrossRef]
- Xie, G.; Li, Q.; Dong, Y.; Duan, G.; Jiang, Y.; Duan, J. Mousika: Enable General In-Network Intelligence in Programmable Switches by Knowledge Distillation. In Proceedings of the IEEE INFOCOM 2022—IEEE Conference on Computer Communications, London, UK, 2–5 May 2022; pp. 1938–1947. [Google Scholar]
- Lu, M.; Zhou, B.; Bu, Z. Two-Stage Distillation-Aware Compressed Models for Traffic Classification. IEEE Internet Things J. 2023, 10, 14152–14166. [Google Scholar] [CrossRef]
- Kim, S.S.; Reddy, A.N. Modeling network traffic as images. In Proceedings of the IEEE International Conference on Communications, Seoul, Republic of Korea, 16–24 May 2005; pp. 168–172. [Google Scholar]
- Taheri, S.; Salem, M.; Yuan, J.S. Leveraging image representation of network traffic data and transfer learning in botnet detection. Big Data Cogn. Comput. 2018, 2, 37. [Google Scholar] [CrossRef]
- Li, F.; Ye, F. Simplifying data traffic classification with byte importance distillation. In Proceedings of the ICC 2021-IEEE International Conference on Communications, Montreal, QC, Canada, 14–23 June 2021; pp. 1–6. [Google Scholar]
- Gomez, S.E.; Hernandez-Callejo, L.; Martinez, B.C.; Sanchez-Esguevillas, A.J. Exploratory study on class imbalance and solutions for network traffic classification. Neurocomputing 2019, 343, 100–119. [Google Scholar] [CrossRef]
- Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
- Sarfraz, F.; Arani, E.; Zonooz, B. Knowledge distillation beyond model compression. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 6136–6143. [Google Scholar]
- Fan, S.; Zhang, X.; Song, Z. Reinforced knowledge distillation: Multi-class imbalanced classifier based on policy gradient reinforcement learning. Neurocomputing 2021, 463, 422–436. [Google Scholar] [CrossRef]
- Boz, O. Extracting decision trees from trained neural networks. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, AB, Canada, 23–16 July 2002; pp. 456–461. [Google Scholar]
- Sethi, I.K. Entropy nets: From decision trees to neural networks. Proc. IEEE 1990, 78, 1605–1613. [Google Scholar]
- He, Y.; Li, W. Image-based encrypted traffic classification with convolution neural networks. In Proceedings of the 2020 IEEE Fifth International Conference on Data Science in Cyberspace, Hongkong, China, 27–30 July 2020; pp. 271–278. [Google Scholar]
- Wang, W.; Zhu, M.; Zeng, X.; Ye, X.; Sheng, Y. Malware traffic classification using convolutional neural network for representation learning. In Proceedings of the 2017 International conference on information networking, Da Nang, Vietnam, 11–13 January 2017; pp. 712–717. [Google Scholar]
- Yuan, X.; Li, C.; Li, X. DeepDefense: Identifying DDoS attack via deep learning. In Proceedings of the 2017 IEEE international conference on smart computing, Hong Kong, China, 29–31 May 2017; pp. 1–8. [Google Scholar]
- Islam, A.; Debnath, A.; Ghose, M.; Chakraborty, S. A survey on task offloading in multi-access edge computing. J. Syst. Archit. 2021, 118, 102225. [Google Scholar]
Domain | Survey | Purpose and Contents |
---|---|---|
Coreset selection | [7] | Methods, application, open issues |
[8] | Methods targeting accurate coresets | |
[9] | Build coresets for time series | |
Data compression | [10] | Approaches and open issues |
[11] | Compress images for wireless sensor networks | |
[12] | Compress videos | |
[13] | Compress time series | |
[14] | Two-dimensional compression algorithms | |
[15] | Three-dimensional compression algorithms | |
Knowledge distillation | [16] | Architectures, algorithms, applications |
[17] | Knowledge/architectures/algorithms/applications | |
[18] | What/who/how to distill in graphs | |
[19] | In deep learning | |
[20] | Graph-based distillation approaches | |
[21] | Distill diffusion models | |
[22] | For computer vision | |
[23] | For medical field | |
Dataset distillation | [24] | Applications |
[25] | Algorithms and performance | |
[26] | Algorithms |
Type | Subtype | Description |
---|---|---|
Data type | Weighted subset of input | Set of weighted input |
Weighted subset of input space | Set of data from the same ground set | |
Sketch matrices | Set of linear combination of inputs | |
Low-dimensional coresets | Data in low dimensional space | |
Generic data structures | Hybrid of all above | |
Query set | Strong | Approximate any given query in the given query set |
Weak | Provide error guarantees for some not all queries | |
Sparse | Providing error guarantees for the optimal query | |
Construction | Uniform sampling | Group inputs and take samples from each group |
Proportional to the size of the group | ||
Significance sampling | Take samples from a distribution approximated by a weighted | |
Average of random draws from the distribution of inputs | ||
Grid sampling | Split input space into cells and take a sample from each cell | |
Greedy construction | Repeatedly pick the next best point based on specific criterion |
Category | Subcategory | Description |
---|---|---|
Types of codes | Block–block | Input and compressed data has the same length |
Block–variable | Input has fixed length but compressed data has various length | |
Variable–block | Input has various length but compressed data has fixed length | |
Variable–variable | Both input and compressed data have various length | |
Types of data | Text | Reduce the body of texts and remove the redundancy |
Image | Reduce the size of image for transmission and storage | |
Audio | Compress music and speech, decoder intensive | |
vedio | Reduce the redundancy in both spatial and temporal dimensions | |
Model | Compress the parameters of huge models in deep learning | |
Data quality | Lossy | Information loss after compression |
lossless | no information loss after compression | |
Coding scheme | Huffman coding | Determine minimum cost prefix-free codes |
Arithmetic coding | Store frequently used characters in fewer bits and vice versa | |
Dictionary-based coding | Find patterns in input, code patterns based on a dictionary | |
Burrows wheeler transform | Permutate input for easier compression | |
Fractal compression | Use fractals to compress digital images | |
Wavelet transform | Use wavelets to transform time-space input to time-frequency codes | |
Quantization | Reduce the precision of input datatype for less computation cost |
Category | Subcategory | Description |
---|---|---|
Knowledge | Response-based | Knowledge extracted from output layer |
Feature-based | Knowledge extracted from intermediate layers | |
Relation-based | Relationships between layers of models | |
Schemes | Offline | Pre-trained teacher model guides the student model |
Online | Teacher and student models are jointly updated online | |
Self | Same model used to train both teacher and student online | |
Algorithms | Adversarial | Use generated adversarial samples in training set |
Multi-teacher | Convert knowledge from multiple teachers | |
Cross-modal | Transfer knowledge among multiple modalities | |
Applications | Visual recognition | Simply models, improve accuracy and efficiency |
Natural language processing | Reduce model size and improve time efficiency | |
Speech recognition | Real-time systems in embedded platforms | |
Others | Recommendation and systems protecting models being attacking |
Category | Subcategory | Description |
---|---|---|
Methods | Performance matching | Tune synthetic set to match the performance of models that are trained on the synthetic and original sets, respectively, |
Parameter matching | Tune synthetic set to match the parameters of models that are trained on the synthetic and original sets, respectively | |
Distribution matching | Tune synthetic set so that the distribution of both synthetic and original can be matched | |
Applications | Continual learning | Generating a small synthetic set capturing the knowledge in the previous training set and add the synthetic set to the current training set |
Federated learning | Exchange the synthetic set distilled from the set of each client to reduce communication cost while maintaining data privacy | |
Privacy and security | Use synthetic set distilled from the original set for privacy protection and avoid attacks | |
Others | Used synthetics set in recommendation, text classifications, and medical systems to reduce the training set and protect the data privacy |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, C.; Liao, L.X.; Chen, G.; Chao, H.-C. Condensation of Data and Knowledge for Network Traffic Classification: Techniques, Applications, and Open Issues. Sensors 2025, 25, 2368. https://doi.org/10.3390/s25082368
Zhao C, Liao LX, Chen G, Chao H-C. Condensation of Data and Knowledge for Network Traffic Classification: Techniques, Applications, and Open Issues. Sensors. 2025; 25(8):2368. https://doi.org/10.3390/s25082368
Chicago/Turabian StyleZhao, Changqing, Ling Xia Liao, Guomin Chen, and Han-Chieh Chao. 2025. "Condensation of Data and Knowledge for Network Traffic Classification: Techniques, Applications, and Open Issues" Sensors 25, no. 8: 2368. https://doi.org/10.3390/s25082368
APA StyleZhao, C., Liao, L. X., Chen, G., & Chao, H.-C. (2025). Condensation of Data and Knowledge for Network Traffic Classification: Techniques, Applications, and Open Issues. Sensors, 25(8), 2368. https://doi.org/10.3390/s25082368