SPinDP: A High-Speed Distributed Processing Platform for Sampling and Filtering Data Streams

Gil, Myeong-Seon; Moon, Yang-Sae

doi:10.3390/app132412998

Open AccessArticle

SPinDP: A High-Speed Distributed Processing Platform for Sampling and Filtering Data Streams

by

Myeong-Seon Gil

and

Yang-Sae Moon

^*

Department of Computer Science and Engineering, Kangwon National University, Chuncheon 24341, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(24), 12998; https://doi.org/10.3390/app132412998

Submission received: 18 October 2023 / Revised: 25 November 2023 / Accepted: 4 December 2023 / Published: 5 December 2023

(This article belongs to the Special Issue Advances in Distributed and Parallel Big Data Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Recently, there has been an explosive generation of streaming data in various fields such as IoT and network attack detection, medical data monitoring, and financial trend analysis. These domains require precise and rapid analysis capabilities by minimizing noise from continuously generated raw data. In this paper, we propose SPinDP (Stream Purifier in Distributed Platform), an open source-based high-speed stream purification platform, to support real-time stream purification. SPinDP consists of four major components, Data Stream Processing Engine, Purification Library, Plan Manager, and Shared Storage, and operates based on open-source systems including Apache Storm and Apache Kafka. In these components, stream processing throughput and latency are critical performance metrics, and SPinDP significantly enhances distributed processing performance by utilizing the ultra-high-speed network RDMA (Remote Direct Memory Access). For the performance evaluation, we use a distributed cluster environment consisting of nine nodes, and we show that SPinDP’s stream processing throughput is more than 28 times higher than that of the existing Ethernet environment. SPinDP also significantly reduces the processing latency by more than 2473 times on average. These results indicate that the proposed SPinDP is an excellent integrated platform that can efficiently purify high-speed and large-scale streams through RDMA-based distributed processing.

Keywords:

data stream; data purification; high-speed stream processing; distributed stream processing; Apache Storm

1. Introduction

This paper introduces an efficient method of applying data purification, a representative technique for data quality management, in a stream environment. A data stream refers to a large amount of data that are continuously generated; most of the data generated online are in stream form. With the recent proliferation of various smart devices and IoT services, data streams are increasing significantly [1,2,3,4]. Unlike conventional bulk data, stream data become very vast as they accumulate, making them even more valuable in the fields of big data and artificial intelligence. In these fields, obtaining high-quality data in various forms, including data streams, is crucial for determining service quality. Therefore, the need for related fields, such as data quality management and data distribution platforms, has also increased. Thus, we present an efficient method of applying data purification to the stream data in the distributed processing environment.

Data purification is one of the representative preprocessing steps that apply missing value correction, aggregation, duplicate removal, sampling, filtering, and other techniques to data to improve quality and processing performance. Among them, missing value correction, aggregation, and duplicate removal correspond to data cleaning in structured environments such as RDBMS or NoSQL. In this paper, we focus on designing a purification platform that emphasizes sampling and filtering algorithms that can be continuously applied to data streams, rather than static and simple value corrections. Although many research studies have been conducted on these purification algorithms, there are still several problems. First, research on processing systems that can purify large-scale, high-speed, and complex data streams is very limited (Problem 1). Existing stream processing engines mainly focus on fast processing of general streams, but purification systems that focus on handling dynamic streams for actual real-time services have not yet been proposed. Second, there are no specialized systems that provide stream sampling or filtering, so users need to implement and apply the required algorithms themselves for data purification (Problem 2). Moreover, existing purification algorithms are usually based on statistical functions that require data accumulation, making them unsuitable for stream environments with rapid data changes. Third, because users need to implement the necessary algorithms for data purification, it is very difficult for non-experts to apply purification through these processes (Problem 3). Even if applicable algorithms are provided, determining which sampling or filtering to apply to data with different characteristics requires even greater expertise.

To address these problems, in this paper we present various sampling and filtering algorithms that can be applied to data streams and propose a new purification platform, called SPinDP (Stream Purifier in Distributed Platform), that can easily utilize these algorithms in a distributed environment. Figure 1 shows the conceptual structure of SPinDP. As shown in the figure, SPinDP consists of four major components: Data Stream Processing Engine, Purification Library, Plan Manager, and Shared Storage. We also utilize open-source solutions for the actual implementation of SPinDP in a distributed environment including Apache Storm [5,6], the distributed stream processing framework, and Apache Kafka [7,8], the distributed queuing system. In order to develop the functionalities of SPinDP, we first define the solutions and requirements for existing problems and then design the overall structure and essential functions. Table 1 summarizes the problems and their corresponding solutions, which are implemented by the functions of the components and integrated into the final platform of SPinDP.

To clearly connect the solution to each problem, we now describe each component and its role in more detail. The first component, Data Stream Processing Engine, performs the distributed processing of data streams through purification algorithms and coordinates the entire system through the Plan Manager and Shared Storage. To solve Problem 1, Data Stream Processing Engine aims to provide purification algorithms that supports large-scale, high-speed data stream processing. In this paper, we design this component based on RDMA-Storm [9], which is accelerated by InfiniBand RDMA (Remote Direct Memory Access) [9], and Apache Kafka [7,8] for seamless stream I/O and transmission management. The second component, Purification Library, is implemented in Storm’s topology structure, and SPinDP provides a total of eight sampling and five filtering algorithms. In particular, three of these algorithms are improved for suitability in data streams through acceleration/intelligence, and the other ten algorithms also have been optimized for distributed stream processing environments using shared memory, etc. With the stream-specific purification algorithms provided by SPinDP, users can easily utilize more diverse purification methods, which solves Problems 2 and 3. The third component, Plan Manager, is a Web-based client that enables efficient use of SPinDP. The fourth component, Shared Storage, performs joint data storage and management among Plan Manager, Data Stream Processing Engine, and Purification Library. Based on these two components, SPinDP provides essential functions for platform users and administrators, such as defining data stream input and output methods, defining data types and schemas, creating/modifying/executing/deleting purification plans, and monitoring execution plans. Plan Manager and Shared Storage improve the accessibility and utilization of stream purification algorithms, which addresses Problem 3. It can also be seen as an indirect solution to Problem 2, as Plan Manager makes it easy for users to drag and drop the suitable combination of purification algorithms for their own data.

SPinDP can be applied in various fields, with representative use cases as follows:

Smart city technology: Utilized for refining smart city data collected from various sensors and IoT devices, providing precise insights into traffic, energy, the environment, and more.
Cyber security: It is important to purify stream data to quickly detect anomalies in network traffic and protect against attacks.
Medical monitoring: Stream purification is needed to monitor patients’ vital signs in real-time and remove noise to help make accurate medical decisions.
Neuroscience research: Removing noise from EEG or fMRI data, which measures brain signals, allows for a more accurate analysis of brain activity.
Environmental surveillance: Environmental data from sensor networks is refined and utilized to derive accurate information about air quality, water quality, and more.
Financial trading analysis: In the stock market, sampling and filtering are used to remove noise from trading data and identify accurate trading signals.
Manufacturing and quality control: Refining and analyzing sensor data from the manufacturing process can help you accurately monitor quality and improve production line efficiency.

As you can see from these examples, stream purification is essential for real-time analytics. In contrast to large-scale batch data analysis, even a small amount of noise in real-time stream analysis can lead to critical errors in the results. It is also very inefficient to use large amounts of raw data for real-time analysis directly. Therefore, adopting SPinDP for sampling or filtering to reduce unnecessary data and performing tasks like anomaly detection, trend analysis, etc., with refined data proves to be effective.

The contributions of this paper can be summarized as follows. Firstly, we propose a new distributed processing platform, SPinDP, to support data stream purification, which cannot be easily achieved with existing technologies. Secondly, we design and implement a high-speed distributed cluster capable of processing data streams without latency by exploiting recent InfiniBand with RDMA technologies. Thirdly, we present new sampling and filtering algorithms suitable for the distributed processing of data streams. Fourthly, we evaluate the functionality and efficiency of SPinDP through experiments, analyze the results, and demonstrate the effectiveness of the platform.

The rest of the paper is organized as follows. Section 2 describes the research background and related work of this paper. In Section 3, we propose the overall architecture and key components of the proposed platform, SPinDP. Section 4 confirms the functionality and efficiency of SPinDP through various experiments. Finally, in Section 5, we summarize and conclude the paper.

2. Background and Related Work

In this section, we discuss the background technologies and related studies of the proposed platform. First, we outline the operational structure and characteristics of Apache Storm and Apache Kafka as stream processing frameworks, which are the bases for implementing SPinDP. Next, we describe InfiniBand, which is used for high-speed internal communication in the proposed system. Finally, we introduce the refinement algorithms that are the core of SPinDP and the characteristics of each algorithm.

2.1. Distributed Processing Framework

In this paper, we utilize two open-source solutions, Apache Storm [5,6] and Apache Kafka [7,8], to construct the proposed stream purification platform. Apache Storm is a distributed parallel processing framework for handling continuously generated stream data. The Storm cluster consists of a single master called Nimbus and multiple workers called Supervisors. This cluster generally manages and shares state information between Nimbus and Supervisors through the open-source distributed coordinator Apache Zookeeper [10,11].

To perform actual stream processing in the distributed Storm cluster structure, we need to implement algorithms following the programming model shown in Figure 2. The figure includes three logical concepts used in implementing tasks in Storm: Spout, Bolt, and Topology. Firstly, a Spout is a component that connects to the original source that generates the data stream, converts the input stream into tuples, which are the data processing units of Storm, and passes them to the next stage. Secondly, a Bolt is a component that receives tuples from a Spout or a previous Bolt and performs actual processing according to the implemented logic. In Storm, the number of parallel Bolts executing the same logic can be set during the configuration. Thirdly, a Topology, which is composed of multiple Spouts and Bolts, defines the entire operation process of Storm and includes all processes from the input to the output of the stream.

The second open-source framework, Apache Kafka [7,8], is a distributed message queuing system designed to reliably send and receive stream data in a large-scale stream environment. Apache Kafka consists of producers, consumers, and brokers, and like Storm, it uses Apache Zookeeper for cluster management. Producers and consumers are client applications where producers play the role of publishing messages to the Kafka cluster, and consumers play the role of subscribing to the desired data from the Kafka cluster. Brokers manage the messages sent by producers through topic channels.

2.2. InfiniBand

InfiniBand [12,13] is a high-performance network device that guarantees high bandwidth and low transmission latency compared to Ethernet. It is a standard switch fabric designed for interconnecting nodes in high-spec computing clusters such as HPC (High-Performance Computer). The representative feature of InfiniBand, RDMA, is a technique for transmitting data directly from the memory of the sending server to the memory of the receiving server without passing through the operating system. There are several advantages to using RDMA communication, such as (1) zero-copy, in which messages are transmitted directly to each server’s memory without copying between network layers; (2) kernel bypass, which enables data transmission without kernel intervention; and (3) no CPU involvement, meaning that memory-based message transmission is used without wasting the CPU resources of remote systems [9]. As a result, the performance overheads of CPU-based operations such as buffer copying and context switching that occur during data transmission can be significantly reduced, leading to an overall improvement in data processing performance.

InfiniBand also provides an IPoIB (IP over InfiniBand) layer that allows devices to be accessed via IP addresses. IPoIB enables better accessibility overall as it allows the use of InfiniBand with IP, but its overall performance is lower than that of RDMA due to the use of the existing TCP/IP network stack. In this paper, we propose an efficient method for improving the performance of Apache Storm using InfiniBand and RDMA as explained so far, and compare the processing efficiency in Ethernet- and IPoIB-based environments to evaluate the performance of the proposed SPinDP.

2.3. Sampling and Filtering Algorithms

Sampling is a statistical technique that extracts some of the data representing the population. As shown in Figure 3, sampling algorithms can be divided into batch-specific sampling and data-stream-specific sampling [14]. First, batch-specific sampling includes systematic sampling, stratified sampling, and cluster sampling [15]. Second, data-stream-specific sampling includes reservoir sampling, priority sampling, hash sampling, KSample, and binary Bernoulli sampling [16,17,18]. A brief explanation of these representative sampling methods is provided as follows:

Systematic sampling randomly selects the first entry, and then selects the k-th entry every time.
Stratified sampling divides the population into nonoverlapping layers, and then extracts samples from each layer.
Cluster sampling divides the population into several subclusters, and then selects some subclusters as samples.
Reservoir sampling selects the first k samples of the data stream, and probabilistically replaces those previous samples using the subsequent input stream.
Priority sampling works similarly to reservoir sampling, but priorities are assigned according to the frequency of data occurrence. That is, the more frequent data, the more that are selected.
Hash sampling applies a hash function to a specific field of data records, and selects a record as a sample if its hash value is greater than (or smaller than) the specified value.
KSample [19] is a random sampling method that dynamically increases the sample size to keep the sampling rate constant for the input data stream.
Binary Bernoulli sampling [20] extracts samples at the same probability when data streams come from multiple sources.

Filtering is a technique for extracting only data satisfying a given condition (or predicate) to improve data quality or analysis speed [21]. Representative examples include query filtering [22] to extract only data satisfying the given query, Bloom filtering [15] to stochastically calculate whether the data belong to the set, and Kalman filtering [23,24] to estimate the original data from data containing errors, such as sensor data. The applications of these filtering techniques to the data stream can be divided into two major categories. First, the standing queries method evaluates the predefined query condition for the input data, and stores or passes the data when the condition is satisfied. Second, the ad hoc method stores input data for a predetermined time or size and evaluates the query on the stored data.

3. SPinDP: Stream Purifier in Distributed Platform

In this section, we propose SPinDP (Full source codes of SPinDP are available at https://github.com/dke-knu/i2am (accessed on 1 December 2023)), a distributed stream purification platform. Section 3.1 describes the overall architecture of SPinDP. Section 3.2 proposes an open-source-based Data Stream Processing Engine. Section 3.3 explains the composition of Purification Library based on sampling and filtering algorithms, and Section 3.4 presents the implementation results of Plan Manager responsible for SPinDP’s UX/UI. Finally, Section 3.5 describes Shared Storage used in SPinDP.

3.1. Overall Architecture of SPinDP

This section describes the overall structure of SPinDP and the inter-relationships among its internal components. Figure 4 shows the overall architecture of SPinDP. As shown in the figure, SPinDP can be divided into four major components: Data Stream Processing Engine, Purification Library, Plan Manager, and Shared Storage. In SPinDP, Apache Kafka is used as the pipeline for the data stream, and the input–processing–output stages are executed continuously. All the information required for each stage is input through a Web-based user interface, and this input information is transmitted to each component through Plan Manager. Metadata that need to be shared in addition to the actual data stream enter Shared Storage. The Data Stream Processing Engine obtains input/output information and purification algorithm parameters through Plan Manager. It loads the algorithm in RDMA-Storm from Purification Library and then executes it, outputting the results in the specified format.

Based on the overall architecture shown in Figure 4, each component developed in a modular way can be connected to form a unified platform through a user interface. This design allows users to selectively adopt each component according to their needs or easily add new features to the platform. In this paper, we focus on designing the entire platform for data stream purification. However, if we rebuild an algorithm library for other purposes instead of the main system’s Purification Library, the rebuilt platform can be utilized for different purposes. In addition, additional features can be easily added and integrated through Plan Manager and Shared Storage.

3.2. Data Stream Processing Engine

The first component, Data Stream Processing Engine (Data Stream Processing Engine is available at https://github.com/dke-knu/i2am/tree/master/rdma-based-storm (accessed on 1 December 2023)), is a distributed framework based on open-source software that is responsible for the actual execution of various algorithms for purifying input streams. The Data Stream Processing Engine works in three stages: stream input, purification, and output. In SPinDP, Apache Kafka is utilized for stream input and output, while Apache Storm is used for stream purification. Most existing stream processing engines execute all three stages in a single program unit, which is highly inefficient as it requires stopping the running program, manually modifying the code, and restarting it when changes are made to input or output information. Additionally, if there is no intermediate storage, streams that are input during the service interruption may be lost. Therefore, in SPinDP, Apache Kafka manages stream input and output, allowing Apache Storm to focus more on the stream purification itself. To integrate Kafka and Storm, the KafkaSpout class provided by Storm is used as input, and the final output Bolt is configured to store results in Kafka. In particular, by registering the Kafka producer keys with Storm’s Spouts, new sources can be seamlessly integrated with the producer even if the input source changes, thus providing uninterrupted service.

Apache Storm, which is the basis of Data Stream Processing Engine, is already a well-made framework designed for stream environments. However, its existing structure has limitations in handling fast-paced big data streams generated from multiple input sources. Particularly, as the format of data streams has expanded from simple text to multimedia and their generation speed has become increasingly fast, there is a need for an advanced stream processing platform that can operate robustly in such changing environments. Based on this observation, in this paper we propose a redesign of Apache Storm by applying two performance enhancement approaches, infrastructure-based and process-grouping-based, to create a stream processing engine that can be utilized for the proposed SPinDP.

Infrastructure-based performance enhancement: In the existing Storm, the Netty library is used for asynchronous input/output communication between nodes via TCP/IP. The threads responsible for sending and receiving data between Storm workers repeat the process of being activated by data transmission based on Netty and then waiting after the completion of data transmission. Context switching of threads that occur during this execution–waiting process is carried out through the CPU, and, accordingly, the CPU overhead increases as the amount of communication between workers increases. Additionally, during the process of transmitting data based on TCP/IP, frequent buffer copying within the operating system consumes a significant amount of CPU resources.

To address this issue, SPinDP introduces RDMA-Storm [9], which applies InfiniBand-based RDMA technology to network communication in Apache Storm, improving the performance of the proposed Data Stream Processing Engine. RDMA-Storm is a new framework that combines Apache Storm with an efficient method of directly accessing memory on other nodes without burdening the CPU by transmitting data in InfiniBand environments. By using RDMA, the network adapter can read and write data directly from memory, thereby eliminating the buffer copying process required by TCP/IP. This solution resolves the issue of CPU overhead and can also improve the overall communication performance of the entire cluster.

Figure 5 represents the final class structure of RDMA-Storm. As shown in the figure, RDMA-Storm modifies and extends the legacy classes to use both Netty- and RDMA-based communications. Based on the newly improved class structure, TCP/IP (Netty) or RDMA communication can be set in the actual Storm configuration, and data transmission between workers is performed through each class according to the corresponding setting. The SPinDP proposed in this paper also utilizes this structure of RDMA-Storm in its Purification Library.

Process-grouping-based performance enhancement: Apache Storm generally uses grouping techniques to determine which node to send data to during stream processing. Storm provides various grouping methods, and our SPinDP introduces locality-aware dynamic grouping [25] to address the issues of shuffle grouping and static grouping.

Shuffle grouping randomly determines the object to receive and send data, but it does not consider the physical distance between nodes or the network conditions. Therefore, network latency can be long even when distributing the same amount of data. To alleviate this issue, Storm supports local-shuffle grouping, which minimizes the latency by processing data on the same node when the object to receive the data is on the same node during shuffle grouping. However, local-shuffle grouping only considers locality, which causes the problem of object transmission imbalance. For example, if the number of Spouts (sending objects) per node is different, the Bolts (receiving objects) on nodes with no sending objects become idle as they have no data to process. On the other hand, if the number of receiving objects is imbalanced among nodes with the same number of sending objects, the nodes with fewer objects may incur severe overload.

To address this grouping issue, the proposed SPinDP utilizes locality-aware dynamic grouping [25]. This calculates the distance between each node through ping operations, sets weights based on the calculated distances, and determines the nodes to receive data based on the set weights. The weight

C l o s e n e s s (N_{i}, N_{j})

is calculated using Equation (1), where n is the number of nodes,

N_{i}

is each node,

T a s k (N_{i})

is the number of tasks (objects) on node

N_{i}

, and

P i n g (N_{i}, N_{j})

represents the ping time from node

N_{i}

to

N_{j}

. In the end, the more objects there are on a node and the shorter the ping response time, the greater the weight assigned to that node [25].

\begin{matrix} C l o s e (N_{i}, N_{j}) = \frac{T a s k (N_{j})}{P i n g (N_{i}, N_{j}) \times T o t a l P i n g (N_{i})}, \\ where T o t a l P i n g (N_{i}) = \sum_{j = 1}^{n} \frac{T a s k (N_{j})}{P i n g (N_{i}, N_{j})} \end{matrix}

(1)

However, if grouping is based only on weights, it is difficult to solve the load balancing problem caused by the sender–receiver object imbalance. This is because selecting a receiving node based only on weight can lead to traffic congestion in certain nodes, causing another problem. Therefore, in the final node selection, we use a random number along with the calculated weights to probabilistically select other nodes appropriately, distributing traffic to prevent idle or overloaded states in specific nodes. As a result, with the locality-aware dynamic grouping supported by SPinDP, more streams can be sent to nearby nodes to minimize the processing latency, and stable stream processing can be provided by distributing traffic probabilistically to prevent overload in some objects.

The newly implemented Data Stream Processing Engine, which has been improved through infrastructure-based and process-grouping-based performance enhancement, is utilized as a core technology of the proposed platform along with Apache Kafka. The Purification Library, Plan Manager, and Shared Storage, which will be explained later, also operate around this Data Stream Processing Engine, and their performance will be evaluated in Section 4.

3.3. Purification Library

The second component, Purification Library (Purification Library is available at https://github.com/dke-knu/i2am/tree/master/i2am-core (accessed on 1 December 2023)), is composed of sampling and filtering algorithms specialized for data streams. To design Purification Library, we first analyzed existing algorithms that are applicable to data stream environments, and then developed new sampling and filtering algorithms specifically for data streams. In this section, we will provide a detailed description of Purification Library, which is built by constructing the enhanced algorithms.

The Purification Library consists of a set of Apache Storm Topologies. Each Topology is composed of sampling and filtering algorithms optimized for data streams, and each algorithm can be executed individually or combined with other algorithms using a unified interface of the user authoring tool. SPinDP provides eight sampling algorithms, i.e., hash [14], systematic [15], reservoir [16], priority [17], KSample [19], UC KSample [26], binary Bernoulli sampling [20], and distributed binary Bernoulli sampling [27], and five filtering algorithms, including query filtering [21], Bloom filtering [22], Kalman filtering [23], and noise recommendation Kalman filtering [24]. Each algorithm is developed through the Topology-based design of Spouts and Bolts, which is optimized for stream environments and adopted to operate on Apache Storm and Kafka.

In particular, UC KSample [26], distributed binary Bernoulli sampling [27], and noise recommendation Kalman filtering [24] are newly developed algorithms aimed at improving purification effectiveness and algorithm usability. Firstly, UC KSample aims to improve the reliability and accuracy of the sampling results. Secondly, distributed binary Bernoulli sampling aims to improve processing efficiency to support large-scale data streams. Finally, noise recommendation Kalman filtering aims to improve algorithm usability, making filtering accessible to non-experts and improving its utility. In SPinDP, these algorithms are implemented in the Apache Storm environment to provide a more advanced stream purification technology.

3.4. Plan Manager

The third component, Plan Manager (Plan Manager is available at https://github.com/dke-knu/i2am/tree/master/i2am-core/i2am-plan-manager (accessed on 1 December 2023)), and the fourth component, Shared Storage, both serve functions for metadata sharing and execution management among the components. SPinDP aims for a modular structure that separates each function. However, this structure can make each core component seem like a separate technology, so we need to provide an integral functionality that organically connects them to operate as a single platform. The Plan Manager and Shared Storage play an important role in enabling the use of such a platform that integrates the modularized functions into a unified form. By integrating these components, the SPinDP platform can prevent issues that frequently arise in existing purification methods due to frequent algorithm modifications, framework-specific settings, and code redistribution, resulting in input stream loss.

The Plan Manager [28] is a UX/UI-based technology introduced to manage data transmission and integration between the main system, expansion system, and integrated interface. This technology supports the management of sampling and filtering processes between Data Stream Processing Engine and Purification Library through plans that define every stage of stream processing, as well as by monitoring the algorithm processing status. In SPinDP, a plan is divided into three stages: source (input), Topology (algorithm), and destination (output), with one source and one destination each, and multiple Topologies per single source.

Figure 6 shows the overall structure of Plan Manager [28]. As shown in the figure, it consists of a command server for RPC communication and a handler layer for plan management. When a user performs operations such as creating, deleting, executing, or stopping a plan through the Web-based integrated interface, the information is stored in JSON format and transmitted to the command server of Plan Manager. The command server distinguishes the received JSON file by command type and calls the specified handler. Following this, the called handler performs the operation according to the input command. At this time, the input command andthe various information sent to each handler are stored in Shared Storage through the handler’s DB Adapter for monitoring, purifying algorithm processing, and so on.

3.5. Shared Storage

The fourth component, Shared Storage, consists of a relational database and an in-memory database. Various information generated within the SPinDP platform must be shareable among the components as needed. Therefore, in this paper, we classify the information stored in each database based on the importance, processing, and transmission speed of the information.

In SPinDP, MariaDB [29] is used as a relational database, and Redis [30] is used as an in-memory database. MariaDB is used to store and manage metadata such as user information, source, plan, destination names, and creation dates for the operation of the integrated interface. The information stored in MariaDB is used to operate and manage Web-based user interfaces in conjunction with Plan Manager. Redis, an in-memory database, is mainly used to store the variables or intermediate operation results of sampling and filtering algorithms implemented in Apache Storm. Since SPinDP is based on stream processing, the processing results also change very quickly. Therefore, if the necessary variables or data shared during algorithm processing are managed in disk-based databases, there is a high possibility of additional latencies in reading and writing; therefore, we adopt an in-memory database to manage such intermediate results.

4. Experimental Evaluation

4.1. Experimental Methods and Environment

In this section, we conduct an experimental evaluation to verify the functionality and efficiency of the implemented SPinDP platform. SPinDP is designed primarily for a distributed cluster environment, as shown in Figure 7. For the experimental data, we used virtual stream data consisting of randomly generated numbers (integers and real numbers) and text (1 KB size array) that were collected and generated internally for SPinDP platform operations.

The hardware and software environments used in our experiments are shown in Table 2. As shown in the table, a distributed cluster consisting of one master node and eight worker nodes is the core hardware environment of SPinDP. The distributed cluster consists of a main system based on Apache Storm and Apache Kafka, and a Web server for Plan Manager. We also built a separate deep learning environment for training the models used by the intelligent algorithms in Purification Library, and the inference models trained on that server run in the same distributed environment as the main system.

To evaluate the functionality of SPinDP, we verify whether the sampling and filtering algorithms supported by Purification Library effectively refine real data. The criteria for measuring the purification results of sampling and filtering are very different depending on the application or environment where the purified data will be used. Thus, in this paper, we evaluate the proper operation of sampling in terms of whether it reflects the pattern of the original data, and filtering in terms of whether it purifies the data according to the conditions. Each experiment utilizes Plan Manager, and Figure 8 provides an example screen of the Web interface used in the experiment. For sampling evaluation, we use reservoir, priority, hash, systematic, KSample, UC KSample, and binary Bernoulli sampling, and represent the original and sampled streams graphically to see whether each algorithm captures the patterns in the original. As for filtering, since each algorithm has different conditions and objectives, it is more challenging to determine the effectiveness of the purification; therefore, we evaluate query filtering, which allows the purification results to be intuitively verified. The original data for each experiment consist of Apple finance data (2010–2018) and cryptocurrency market data (July 2017). Apple finance data consists of daily closing prices over a period of 9 years, comprising approximately 3280 tuples. Additionally, for cryptocurrency market data, we utilized six sets of 1440 tuples each, representing the per-second prices of Bitcoin measured throughout a day.

To evaluate the efficiency of SPinDP, we consider three Storms: Ethernet-based, IPoIB-based, and RDMA-based Storms. In the experiments, we rapidly generate input data consisting of random integers, real numbers, and text combined in 1 KB-sized virtual streams without latency to verify the distributed processing performance of SPinDP. For each Storm, we run the purification algorithms and compare the stream processing throughput and latency of each. In order to obtain the maximum performance in SPinDP, we apply RDMA-based Storm and locality-aware dynamic grouping together for experimentation. As a performance comparison target, the existing Ethernet-based and IPoIB-based Storms use the same hardware environment and shuffle grouping. In each environment, experiments are performed for the purification algorithms KSample, systematic sampling, query filtering, and Bloom filtering, and each algorithm is implemented in a Storm Topology form like that shown in Figure 2. The stream processing throughput and latency are measured for five minutes.

4.2. Evaluation of Purification Functionality

The functionality evaluation of SPinDP is divided into sampling and filtering. First, Figure 9 shows the results of the original data and the samples extracted by each sampling algorithm. In the figure, the top left is the original Apple finance data, and the rest are the results of the samples extracted by each sampling algorithm. Through these graphs, we can intuitively observe that the trends and shapes of the original and the samples are very similar. This means that the sampling algorithms supported by SPinDP work correctly in purifying the real data stream.

The evaluation of filtering functionality is verified through query filtering, which involves refining values below 290,000 KRW from the Ethereum market stream. Figure 10 shows the query filtering results on a day with different patterns from the daily 1-min price data. In the figure, the blue line is the original stream, and the orange line is the filtered stream. Based on the experimental results, we can see that the query filtering accurately refines the values below 290,000 KRW. In other words, values in the source data that are less than the threshold of 29,000 won will be removed by query filtering. This result means that query filtering, which is a simple but precise method to obtain only the data users want, works correctly in the stream environment built by SPinDP.

Figure 11 shows the data processing throughput (i.e., tuple processing throughput) measured during the execution of the purification algorithms. As shown in the graph, we can see that all purification algorithms have higher throughput in the order of Ethernet, IPoIB, and RDMA. In particular, considering that the vertical axis of the graph is logarithmic, we can observe that there is a significant difference in performance between different network environments. As a result, in the throughput comparison experiment, we confirmed that the performance of RDMA improved by an average of 12.6 times compared to IPoIB and up to 28.0 times compared to Ethernet. In the case of Bloom filtering, the throughput of RDMA was improved by up to 27.6 times compared to IPoIB and 66.5 times compared to Ethernet. This is because the complexity of the Bloom filtering Topology increased compared to the other purification algorithms, leading to an increase in the number of tuple transmissions between tasks. As the number of tuple transmissions between tasks increases, the external network usage also increases, ultimately resulting in different overall throughput depending on the network environment.

The results of the latency experiments, shown in Figure 12, also demonstrate that RDMA is faster than both IPoIB and Ethernet. As shown in the figure, the latency of RDMA is 2819.4 times faster than IPoIB on average and 2473.7 times faster than Ethernet. This is a much larger difference than the performance improvement seen in the tuple processing throughput, and there are two major reasons for this. First, InfiniBand has an around 60.0 times bandwidth advantage compared to Ethernet, and in the RDMA-Storm environment, locality-aware dynamic grouping is adopted to minimize communication with other nodes and to maximize processing efficiency. Moreover, even when communication with other nodes is required, RDMA can read and process tuples directly from memory without going through the operating system, which likely contributed to the reduction in latency. Second, bottlenecks are inevitable in Topology structures, but with RDMA and locality-aware dynamic grouping, such bottlenecks can be addressed more efficiently. In addition, since the input stream is continuously incoming without latency, the latency in processing can accumulate in the event of a bottleneck, resulting in the performance differences observed in this experiment. In the latency experiment results, all algorithms except for Bloom filtering show that IPoIB is slower than Ethernet. This is likely due to the fact that in the InfiniBand environment with a high bandwidth, additional latency can occur due to CPU load and buffer copying when data are transmitted through the operating system, unlike RDMA.

SPinDP’s performance evaluation results require some considerations in interpretation. First, the data we used for performance evaluation is 1 KB tuple data, and the performance evaluation results may vary depending on the data size. Experimental results related to this aspect can be found in [9]. Second, the input rate of the original data may also affect the experimental results. We used extremely constant stream data as input to verify the maximum performance of RDMA. If the interval between streams is larger than one second, the performance evaluation results may be different. Third, depending on the hardware environment and algorithms applied to SPinDP, issues such as stream explosion and bottlenecks may occur. We did not observe any bottlenecks in the experiments due to the high efficiency of SPinDP, but it is important to note that an increased number of high-speed stream sources may lead to the emergence of such problems.

5. Conclusions

In this paper, we proposed an open-source-based high-speed distributed purification platform, SPinDP, which supported efficient sampling and filtering in constantly generated stream environments. To design SPinDP, we first identified three problems with existing purification technologies and derived major components to address these problems. We then subdivided the roles of each component and designed their functions over Apache Storm and Apache Kafka. SPinDP featured RDMA-Storm and locality-aware dynamic grouping and consisted of four major components: Data Stream Processing Engine, Purification Library, Plan Manager, and Shared Storage. In particular, we aimed to increase the platform’s usability with novel purification algorithms for both expert and non-expert users. Finally, we confirmed the functionality and efficiency of the proposed SPinDP through experiments on a real distributed cluster. The results demonstrated that SPinDP correctly performed purification functions for sampling and filtering and significantly improved the performance compared to not only conventional Ethernet but also recent IPoIB. Based on these results, we believe that SPinDP is a new attempt to integrate and expand existing sampling and filtering algorithms and provides stable and high-level purification of large-scale streams. As future work, we plan to enhance SPinDP to not only support real-time streaming but also to handle large-scale bulk data processing. Additionally, we will explore the utilization of the improved SPinDP within the Ray framework instead of the Storm framework for more effective handling of large-sized stream data.

Author Contributions

Conceptualization, formal analysis, software, and writing—original draft preparation and editing, M.-S.G.; conceptualization, formal analysis, writing—review and editing, and validation, Y.-S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2021-0-00859, Development of A Distributed Graph DBMS for Intelligent Processing of Big Graphs) and a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. NRF-2022R1A2C1003067).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors thank Siwoon Son, Seokwoo Yang, Wonhyeong Cho, Hajin Kim, Sebin Park, and Youngkuk Kim, for their technical research and development contributions to this project.

Conflicts of Interest

The authors declare that there are no conflict of interest regarding the publication of this paper.

References

Bahri, M.; Bifet, A.; Gama, J.; Gomes, H.M.; Maniu, S. Data Stream Analysis: Foundations, Major Tasks and Tools. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2021, 11, e1405. [Google Scholar] [CrossRef]
Cardellini, V.; Presti, F.L.; Nardelli, M.; Russo, G.R. Runtime Adaptation of Data Stream Processing Systems: The State of the Art. ACM Comput. Surv. 2022, 54, 1–36. [Google Scholar] [CrossRef]
Abbasi, A.; Javed, A.R.; Chakraborty, C.; Nebhen, J.; Zehra, W.; Jalil, Z. ElStream: An Ensemble Learning Approach for Concept Drift Detection in Dynamic Social Big Data Stream Learning. IEEE Access 2021, 9, 66408–66419. [Google Scholar] [CrossRef]
Herodotou, H.; Odysseos, L.; Chen, Y.; Lu, J. Automatic Performance Tuning for Distributed Data Stream Processing Systems. In Proceedings of the 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022; pp. 3194–3197. [Google Scholar]
Apache Storm. Available online: http://storm.apache.org/ (accessed on 1 December 2023).
Toshniwal, A.; Taneja, S.; Shukla, A.; Ramasamy, K.; Patel, J.M.; Kulkarni, S.; Jackson, J.; Gade, K.; Fu, M.; Donham, J.; et al. Storm@Twitter. In Proceedings of the International Conference on Management of Data, ACM SIGMOD, Snowbird, UT, USA, 22–27 June 2014; pp. 147–156. [Google Scholar]
Apache Kafka. Available online: http://kafka.apache.org/ (accessed on 1 December 2023).
Kreps, J.; Narkhede, N.; Jun, R. Kafka: A Distributed Messaging System for Log Processing. In Proceedings of the NetDB, Athens, Greece, 12 June 2011; pp. 1–7. [Google Scholar]
Yang, S.; Son, S.; Choi, M.-J.; Moon, Y.-S. Performance Improvement of Apache Storm using InfiniBand RDMA. J. Supercomput. 2019, 75, 6804–6830. [Google Scholar] [CrossRef]
Apache Zookeeper. Available online: http://zookeeper.apahce.org/ (accessed on 1 December 2023).
Ekpe, O.; Kwabena, P.M. Availability of Jobtracker Machine in Hadoop/Mapreduce Zookeeper Coordinated Clusters. Adv. Comput. 2012, 3, 19–30. [Google Scholar]
MacArthur, P.; Liu, Q.; Russell, R.D.; Mizero, F.; Veeraraghavan, M.; Dennis, J.M. An Integrated Tutorial on InfiniBand, Verbs, and MPI. IEEE Commun. Surv. Tutorials 2017, 19, 2894–2926. [Google Scholar] [CrossRef]
Shpigelman, Y.; Shainer, G.; Graham, R.; Qin, Y.; Cisneros-Stoianowski, G.; Stunkel, C. NVIDIA’s Quantum InfiniBand Network Congestion Control Technology and Its Impact on Application Performance. In Proceedings of the High Performance Computing: 37th International Conference, Hamburg, Germany, 29 May–2 June 2022; pp. 26–43. [Google Scholar]
Haas, P.J. Data-Stream Sampling: Basic Techniques and Results. In Data Stream Management: Processing High-Speed Data Streams; Springer: Berlin/Heidelberg, Germany, 2016; pp. 13–44. [Google Scholar]
Cochran, W.G. Sampling Techniques, 3rd ed.; Wiley: Hoboken, NJ, USA, 1977. [Google Scholar]
Vitter, J.S. Random Sampling with a Reservoir. ACM Trans. Math. Softw. 1985, 11, 37–57. [Google Scholar] [CrossRef]
Sibai, R.E.; Chabchoub, Y.; Demerjian, J.; Kazi-Aoul, Z.; Barbar, K. Sampling Algorithms in Data Stream Environments. In Proceedings of the International Conference on Digital Economy (ICDEc), Carthage, Tunisia, 28–30 April 2016; pp. 29–36. [Google Scholar]
Cohen, E. Stream Sampling for Frequency Cap Statistics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 159–168. [Google Scholar]
Kepe, T.R.; de Almeida, E.C.; Cerqueus, T. KSample: Dynamic Sampling Over Unbounded Data Streams. J. Inf. Data Manag. 2015, 6, 32–47. [Google Scholar]
Cormode, G.; Muthukrishnan, S.; Yi, K.; Zhang, Q. Optimal Sampling from Distributed Streams. In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database Systems, Indianapolis, IN, USA, 6–11 June 2010; pp. 77–86. [Google Scholar]
Cheng, R.; Kao, R.B.; Kwan, A.; Prabhakar, S.; Tu, Y. Filtering Data Streams for Entity-based Continuous Queries. IEEE Trans. Knowl. Data Eng. 2010, 22, 234–248. [Google Scholar] [CrossRef]
Shin, J.; Eom, S.; Lee, K.H. Q-ASSF: Query-adaptive Semantic Stream Filtering. In Proceedings of the 9th International Conference on Semantic Computing, Anaheim, CA, USA, 7–9 February 2015; pp. 101–108. [Google Scholar]
Olfati-Saber, R. Distributed Kalman Filtering for Sensor Networks. In Proceedings of the IEEE Conference on Decision and Control, New Orleans, LA, USA, 12–14 December 2007; pp. 5492–5498. [Google Scholar]
Park, S.; Gil, M.-S.; Im, H.; Moon, Y.-S. Measurement Noise Recommendation for Efficient Kalman Filtering over A Large Amount of Sensor Data. Sensors 2019, 19, 1168. [Google Scholar] [CrossRef] [PubMed]
Son, S.; Moon, Y.-S. Locality/Fairness-Aware Job Scheduling in Distributed Stream Processing Engines. Electronics 2020, 9, 1857. [Google Scholar] [CrossRef]
Kim, H.; Gil, M.-S.; Moon, Y.-S.; Choi, M.-J. Variable Size Sampling to Support High Uniformity Confidence in Sensor Data Streams. Int. J. Distrib. Sens. Netw. 2018, 14, 1550147718773999. [Google Scholar] [CrossRef]
Cho, W.; Gil, M.-S.; Choi, M.-J.; Moon, Y.-S. Storm-based Distributed Sampling System for Multi-source Stream Environment. Int. J. Distrib. Sens. Netw. 2018, 14, 1550147718812698. [Google Scholar] [CrossRef]
Kim, Y.; Son, S.; Moon, Y.-S. SPMgr: Dynamic Workflow Manager for Sampling and Filtering Data Streams over Apache Storm. Int. J. Distrib. Sens. Netw. 2019, 15, 1550147719862206. [Google Scholar] [CrossRef]
MariaDB. Available online: https://mariadb.org/ (accessed on 1 December 2023).
Redis. Available online: http://redis.io/ (accessed on 1 December 2023).

Figure 1. Conceptual structure with four major components in SPinDP.

Figure 2. An example structure of Apache Storm’s programming model.

Figure 3. Taxonomy of sampling algorithms by domain.

Figure 4. Overall architecture of SPinDP.

Figure 5. Class structure of RDMA-Storm.

Figure 6. Overall architecture of Plan Manager.

Figure 7. Cluster diagram of the SPinDP.

Figure 8. An example screen of Plan Manager.

Figure 9. Apple finance data sampling results for each algorithm.

Figure 10. Query filtering results of cryptocurrency market stream.

Figure 11. Data stream throughput comparison results.

Figure 12. Data stream processing latency comparison result.

Table 1. The problems and their corresponding solutions with SPinDP components.

Related Problems	Solutions
Problem 1	(Data Stream Processing Engine) Design a new high-speed distributed parallel processing engine based on ultra-fast networking technology
Problems 2 and 3	(Purification Library) Development of stream sampling and filtering algorithms for supporting multiple and complex data streams
Problem 3 (partially Problem 2)	(Plan Manager & Shared Storage) Support for Web-based user interface tools for both expert and non-expert users

Table 2. Hardware and software specifications for core components.

Component	Software	Hardware
Data Stream Processing Engine	• OS: CentOS 7 • Stream processing: Apache Storm 2.0.0 • Queueing: Apache Kafka 2.1.1 • Coordinator: Apache Zookeeper 3.4.10 • RDMA library: JXIO 1.0.3	• Master (1 node): Intel E5- 2630V3 2.4 GHz 8 Core, 32 GB RAM, 256 GB SSD, 1TB HDD • Worker (8 nodes): Intel E5-2620V3 2.4 GHz 6 Core, 32 GB RAM, 256 GB SSD, 1TB HDD • Ethernet: ipTIME T16000 (1Gbps LAN) • InfiniBand: Mellanox SwitchX®-3 MSX 6012F-1BFS Managed FDR56Gbps
Purification Library
Shared Storage	• RDBMS: MariaDB 10.0 • In-memory DB: Redis 5.0
Plan Manager	Server: Apache Tomcat 9.0 Client: Java 8.0 Language: JSP, HTML, CSS, JavaScript Libraries: Ajax, jQuery, jsPlump, QueryBuilder

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gil, M.-S.; Moon, Y.-S. SPinDP: A High-Speed Distributed Processing Platform for Sampling and Filtering Data Streams. Appl. Sci. 2023, 13, 12998. https://doi.org/10.3390/app132412998

AMA Style

Gil M-S, Moon Y-S. SPinDP: A High-Speed Distributed Processing Platform for Sampling and Filtering Data Streams. Applied Sciences. 2023; 13(24):12998. https://doi.org/10.3390/app132412998

Chicago/Turabian Style

Gil, Myeong-Seon, and Yang-Sae Moon. 2023. "SPinDP: A High-Speed Distributed Processing Platform for Sampling and Filtering Data Streams" Applied Sciences 13, no. 24: 12998. https://doi.org/10.3390/app132412998

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SPinDP: A High-Speed Distributed Processing Platform for Sampling and Filtering Data Streams

Abstract

1. Introduction

2. Background and Related Work

2.1. Distributed Processing Framework

2.2. InfiniBand

2.3. Sampling and Filtering Algorithms

3. SPinDP: Stream Purifier in Distributed Platform

3.1. Overall Architecture of SPinDP

3.2. Data Stream Processing Engine

3.3. Purification Library

3.4. Plan Manager

3.5. Shared Storage

4. Experimental Evaluation

4.1. Experimental Methods and Environment

4.2. Evaluation of Purification Functionality

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI