Macro SOStream: An Evolving Algorithm to Self Organizing Density-Based Clustering with Micro and Macroclusters

Oliveira, Andressa Stéfany; de Abreu, Rute Souza; Guedes, Luiz Affonso

doi:10.3390/app12147161

Open AccessArticle

Macro SOStream: An Evolving Algorithm to Self Organizing Density-Based Clustering with Micro and Macroclusters

by

Andressa Stéfany Oliveira

^1,*

,

Rute Souza de Abreu

^2,*

and

Luiz Affonso Guedes

^2,*

¹

Computer and Software Engineering Department, Polytechnique Montreal, Montreal, QC H3T 1J4, Canada

²

Department of Computer Engineering and Automation—DCA Campus Universitário, Federal University of Rio Grande do Norte—UFRN, Natal 59077-080, RN, Brazil

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2022, 12(14), 7161; https://doi.org/10.3390/app12147161

Submission received: 19 June 2022 / Revised: 1 July 2022 / Accepted: 4 July 2022 / Published: 15 July 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This paper proposes a new evolving algorithm named Macro SOStream with entirely online learning and based on self-organizing density for data stream clustering. The Macro SOStream is based on the SOStream algorithm, but we incorporate macroclusters composed of microclusters. While microclusters have spherical shapes, macroclusters can have arbitrary shapes. Moreover, the Macro SOStream has the macrocluster merge functionality specially designed to improve its performance under data drift contexts. The Macro SOStream’s performance is compared to SOStream and DenStream algorithms’ performance using four synthetic datasets and the ARI performance metric to validate our proposal. Furthermore, we carry out an exhaustive analysis on the influence of adequate hyperparameter setup on these algorithms’ performance. As a result, the Macro SOStream presents good performance mainly in the context of data drift and for demands of non-spherical clusters.

Keywords:

evolving systems; data stream; online learning; clustering

1. Introduction

Nowadays, many real-world problems are solved using machine learning (ML) techniques, such as detecting fraudulent use of credit cards, detecting and diagnosing faults in industrial processes, data interoperability in IoT environments, and suggesting new products to customers [1,2,3,4]. According to the adopted learning approach, the ML models can be considered supervised, unsupervised, semi-supervised, or reinforcement.

Data clustering is a very relevant application area of unsupervised learning. It does not require any previous knowledge about the output data to carry out the model training, requiring only the input data to do it [5]. More specifically, clustering, or data cluster analysis, is the set of data mining techniques that aims to make automatic data groupings according to their degree of similarity. The similarity criterion is part of the problem definition and depends on the algorithm. All grouping data resulting from this process are given the name of the cluster.

Traditionally, data clustering algorithms use offline learning strategies, which need to have prior knowledge of the data and store all the data necessary for training the model. When there is a significant change in the data pattern, there is a need to retrain (recalibrate) the model, requiring a new set of data for the new training [6]. However, problems such as TCP/IP traffic data, e-commerce, and industrial plant monitoring, which generate data stream continuously, purely offline learning strategies become inadequate because the statistical distributions of these data can change significantly over time. Moreover, training the model using techniques traditionally designed to work offline is not viable, as this data grows indefinitely, and storing it would require much memory [6,7].

On the other hand, some data clustering algorithms using online learning strategies can dynamically adapt to changes in the environment (data drift) since they keep their learning mechanisms active throughout their use. These algorithms do not need to have the entire dataset available a priori because the processing takes place sample by sampling in a serial manner [6]. For this reason, when dealing with online grouping methods, characteristics such as volume, density, shape, or temporal order of the data are used to form the grouping, as it is impracticable to store all data [7]. Thus, algorithms and systems that work online are more viable in environments with limited memory, single data pass, and that require real-time response [7].

Moreover, algorithms need to have the ability to dynamically evolve the internal structure of the model because, in such a scenario, data are generally not stationary. That is, there may be a change in the data pattern or data deviation caused, for example, by variation in the quality of the raw material in oil refining, and the lighting of image processing [8,9].

In a data clustering context, algorithms that can adapt their cluster parameters (such as center position and density) for data drift are considered adaptive algorithms. Simultaneously, algorithms that can not only adapt to data drift, but also to the emergence of new clusters and the elimination of non-active clusters (data evolution) are considered to be evolving algorithms [8,9]. Thus, evolving algorithms can develop and update unfamiliar environments and have no problems with non-stationary data [8,9,10]. Some evolving clustering algorithms proposed in the literature are CluStream [11], DenStream [12], C-DenStream [13], SDStream [14], SOStream [15], Online Elliptical Clustering—OEC [16], and Auto-Cloud [17].

In this paper, we propose a new evolving clustering algorithm named Macro SOStream. The Macro SOStream is based on SOStream [15] and uses entirely online learning based on self-organizing density for data stream clustering. Therefore our contributions are:

The creation of macroclusters with arbitrary shapes. The difference between SOStream and Macro SOStream is the creation of macroclusters from microclusters, where this process takes place entirely online. Unlike microclusters, macroclusters can assume arbitrary shapes.
A merging mechanism for macroclusters to improve their performance and robustness to noisy data.

The primary motivation of the proposal is to improve the SOStream algorithm. Thus, the proposal expands the technique already utilized in SOStream with microclusters to the creation of macroclusters. The significant differeces between SOStream to other evolving algorithms based on density is the technology used to perform clustering on the data stream. The SOStream is based on a self-organizing map (SOM), or Kohonen network.

Through the research carried out in the literature, we pre-selected DenStream to compare with the proposed algorithm. DenStream uses the concept of density, and it is relevant in the field because it was the basis for many other evolving algorithms. The DenStream algorithm is the most used to compare with other algorithms as in [3,13,18,19,20,21,22]. Furthermore, in addition to DenStream, Macro SOStream will also be compared with its base algorithm, SOStream. This comparison is performed by analyzing the performance of the algorithms using the Adjusted Rand Index (ARI) metric and the execution time of the data analysis.

The rest of the paper is organized as follows. Section 2 presents a background and related work to evolving algorithms. Section 3 explains in detail the algorithm Macro SOStream. Section 4 presents the experimental setup and an empirical performance evaluation, hyperparameter sensitivity analysis of the Macro SOStream algorithm using the ARI metric, and the algorithms’ execution time. Finally, the paper concludes in Section 5.

2. Background and Related Work

The development and use of evolving algorithms have been the subject of interest and study in recent years. These algorithms can develop and update unfamiliar environments and detect deviations of input data over time. Thus, it is possible to solve problems in which changes in the data stream occur, e.g., detecting changes in the user profile on social networks [7,8,9,10,17].

A particularity of evolving systems is the concept evolution; they simultaneously learn the structure and parameters through the data stream. Here, the data stream is a continuous sequence of samples, where the order of that sequence matters. Moreover, it has an infinite size; that is, the sample can be represented by

x (t)

, where time t varies from 0 to infinity [23]. Due to their ability to adapt, evolving systems can be considered as intelligent adaptive systems, which can have structural components: artificial neurons, production rules, fuzzy rules, data clusters, or subtrees [8,9].

It is significant to highlight that clustering data stream algorithms with online learning must deal adequately with both the shift and concept drift problems from data with non-stationary statistics. Concept drift means an unpredictable change over time in the statistical properties of the input or target variables [24]. These changes can be in the distribution of the input data (input space drift), in the relationship between the inputs and the model (joint drift), or the distribution of the respective target class (target concept drift) [8,9,25].

Figure 1 shows some types of concept drift, where the circles represent samples obtained over time. The changes can occur incrementally with many phases in the migration of the concept; intensely (sudden or abrupt drift, shift drift, concept evolution); or gradual (local drift), activating the system’s automatic evolution mechanism, known as passive drift handling [9,25,26]. There is also the case where earlier concepts may appear again after some time, as shown in Figure 1d.

Still another form of concept deviation is cyclic drift, where changes in data deviations can occur cyclically [9]. These changes in the dynamics of the data stream can occur due to seasonal effects, wear, contamination, change of catalyst in the chemical industry, or operational changes, among other phenomena [8,9,26].

Another feature of evolving algorithms and methods is the operation in the single-data-pass mode. The performed calculations are made recursively; this consists of discarding the sample after processing the algorithm, considering that only the last data is available. The algorithm performs the clustering from the current sample and statistical aggregations or model parameters, such as centroids and prototypes obtained based on previous samples [8].

Some examples of usability of evolving systems in real applications are the use of TEDA (Typicality and Eccentricity Data Analytics) for the detection of failures in industrial processes [27], and spam detection in e-mail [28]. Another example is the optimization of routing in the MANET (Mobile Ad-hoc Network) and, consequently, reducing energy consumption and increasing the network’s useful life and stability using ICA (Imperialist Competitive Algorithm) [29]. Finally, in [30], a semi-supervised algorithm was used to classify data from eleven real-world datasets, some of which are electricity, chemical sensors, e-mails, and TCP type limit records.

Although previous approaches are recent, evolving data stream clustering algorithms had appeared in the literature for some time. Some examples include: CluStream [11], which is based on hierarchy and works partly offline with K-Means and online part for new managers; DenStream [12], based on density, which initially has an offline learning phase for the creation of microclusters and macroclusters at the end of the algorithm execution, and an online learning phase for updating and creating new microclusters; C-DenStream [13], which is an extension of DenStream that includes domain information for clusters and changes in the offline phase, replacing DBSCAN for C-DBSCAN; SDStream [14], based on the radius and density for creating microclusters in the online training phase, also having an offline training phase, which generates the final macroclusters.

There are also HDDStream [31] and PreDeConStream [18] that are based on density, which group high-flow data and have online and offline training phases. More recently, OEC [16] performs the data grouping using the correlation; SwarmStream algorithm [3], which is inspired by self-organizing phenomena in complex systems; and Auto-Cloud [17] is based on the concept of Typicality and Eccentricity Data Analysis (TEDA) [32].

Another evolving algorithm is SOStream [15] with entirely online learning and based on self-organizing density (Self-organizing Maps—SOM) for data stream clustering. It also uses the concept of microclusters. The algorithm has seven sub-algorithms, which describe the creation, determination of close neighbors, updating, search for intersection, fusion, and forgetting of the microclusters. However, the cluster-forgetting algorithm is optional.

Many recent and interesting proposals address the data stream clustering problem using the microcluster and macrocluster concepts in literature. Among them, we can cite the HCMstream [19], VHEC [33], BEstream [20], ACSC [34], DyClee [21], and DGStream [22]. The HCMstream, Hyper-cylindrical micro-clustering for streaming data [19], proposes to use hyper-cylindrical forms to merge the spherical microclusters. The macroclusters are composed of connected microclusters and have arbitrary shapes. The proposal is interesting, and the results indicate its efficiency in determining arbitrary form clusters.

The VHEC, Versatile Hyper-Elliptic Clustering [33], and BEstream, Batch Capturing with Elliptic Function for One-Pass Data Stream Clustering [20], create hyper-elliptic microclusters with arbitrary rotation. Both algorithms determine a new similarity distance measure to the data stream clustering, and the BEstream can be used in high-dimensional space. The ACSC or Ant Colony Stream Clustering algorithm [34] works with windows of the data stream and in two steps. First, the algorithm identifies clusters, and then these clusters are summarized in statistics and stored offline.

The DyClee, dynamic clustering for tracking evolving environments [21], is a dynamic clustering algorithm that uses the two-stage distance and density-based clustering approach to determine the macroclusters. First, the distance-based clustering obtains the microclusters, and then the density-based clustering is used to eliminate the outlier clusters (clusters with a small amount of data). The microcluster density is defined low, medium, or high and is used to create the arbitrary shape macroclusters by a density-based approach. Thus, the macroclusters are composed of connected microclusters. Moreover, data samples are associated with a microcluster using the Manhattan distance as the similarity metric.

Finally, the DGStream, Density and Grid-based data Stream clustering [22], is an online-offline grid and density-based stream clustering algorithm. In its online phase, it uses a grid-based approach to represent the microcluster. While in its offline phase, DGStream employs the DBSCAN algorithm to determine the macroclusters, similar to the DenStream algorithm.

Table 1 summarizes the main features of these algorithms. Considering how the concepts of microclusters and macroclusters are adopted, our proposal, called Macro SOStream, can be considered similar to the HCMstream, BEstream, and DyClee algorithms. These algorithms have only online learning (except for BEstream), the merge function, and arbitrary macrocluster shapes. The difference is in spherical microcluster shapes, which makes it simpler. Furthermore, Macro SOStream proposes to use the Self-Organizing Map (SOM) concept to create and manage clusters. The SOM, or Kohonen, network transforms high-dimensional data to low-dimensional data and preserves the characteristics of the data [35]. The SOM learning is based on the competition of neurons and the winning neuron influences neighborhood neurons. Thus, on Macro SOStream, the winning microcluster influences the neighborhood of the microcluster.

3. Macro SOStream Algorithm

This section presents the Macro SOStream algorithm, an evolving algorithm for clustering data streams. Aiming to explain how the proposed algorithm works, Figure 2 shows the entire workflow performed by the Macro SOStream. Moreover, Figure 3 illustrates part of the algorithm processing, where the small spheres are microclusters, and dashed lines represent the macroclusters. This algorithm is based on the algorithm Self Organizing density-based clustering over data Stream - SOStream [15], which has only microclusters. However, the Macro SOStream has the function of creating arbitrary macroclusters online, as illustrated in Figure 3a, where the macroclusters are clusters of spherical microclusters.

Macro SOStream has the same three input hyperparameters of the SOStream: the scale factor

α

, the minimum number of neighbors

η

, and the fusion limit

γ

. In addition to them, there is the new p hyperparameter, which is the decisive limit for whether or not

v (t)

belongs to the macrocluster. Here is used the notation

η

to represented the minimum number of neighbors, but

M i n P t s

is used in SOStream.

After the first step of Figure 2, the hyper-parameterization, the algorithm receives the d-dimensional input vector,

v (t)

in time t, where

t = (1, 2, 3, \dots, n)

, step 2. Initially, a macrocluster

M a c r o_{j}

with

j = 1

is created, it will receive the first microclusters. The point

v (t)

will form the first

η

microclusters since

η

represents the minimum number of microclusters in the macrocluster. Consequently, the algorithm checks the number of microclusters in

M a c r o_{j}

as shown in step 3. Thus, in step 4, a set microcluster

M i c r o (t)

is created for each instant of time t, where each microcluster has information about the number of points related to it

n_{i}

, the radius

r_{i}

, and the centroid

C_{i}

.

Consequently, each microcluster is represented by the tuple

M i c r o_{i} = (n_{i}, r_{i}, C_{i})

, corresponding to

(1, 0, v (t))

at the start of a microcluster. This microcluster representation in tuples follows the same pattern present in SOStream. The macroclusters are represented by

M a c r o_{j} = (n_{j}, C_{j}, m i c r o s_{j})

. The

n_{j}

is the number of microclusters,

C_{j}

is the centroid of the macrocluster, and

m i c r o s_{j}

is the list of microclusters.

The Euclidean distance between

v (t)

and all the centroids of the microclusters present in

M a c r o_{j}

is calculated for every

v (t)

that arrives at the algorithm. Moreover, the winner microcluster

M i c r o_{w i n}

will be the one with the shortest distance from

v (t)

, step 5.

In step 6, there is a check if

v (t)

is within the radius of the winner microcluster. If so, the winner microcluster has its centroid updated using the traditional arithmetic mean approach, step 7. The radius value will be the

η

-th distance between the winner and all its neighbors.

The algorithm also updates the centroid of the winner’s neighboring microclusters (step 8), bringing them closer together by the following equation:

C_{i} (t + 1) = C_{i} (t) + α β (C_{w i n} (t) - C_{i} (t)) .

(1)

Here,

β

is defined by:

β = e^{- \frac{d (C_{i}, C_{w i n})}{2 r_{w i n}^{2}}} .

(2)

Moreover,

α

is a hyperparameter;

C_{w i n}

and

r_{w i n}

are the centroid and radius of the winner microcluster. This technique, already present in SOStream, is inspired by updating weights for the winning neuron in a Self-Organizing Map (SOM), or Kohonen Network.

For the case where

v (t)

does not belong to the winning microcluster, the distance between

v (t)

and the centroid of the winning microcluster will determine whether the sample belongs to

M a c r o_{j}

, step 10. The equations that define the distance and threshold are as follows:

d (v (t), C_{w i n}) > p \cdot r_{w i n} .

(3)

In this case, p is a hyperparameter, and

d (v (t), C_{w i n})

is the Euclidean distance between

v (t)

and the centroid of the winning microcluster

C_{w i n}

. If Equation (3) is true,

v (t)

does not belong to the analyzed

M a c r o_{j}

, step 11. The Macro SOStream verifies if

v (t)

belongs to existing macroclusters. If not, it creates a new macrocluster. Accordingly, the algorithm checks if there is another macrocluster (

M a c r o_{j + 1}

) to analyze, as shown in step 12. If there is no

M a c r o_{j + 1}

, it creates the macrocluster

M a c r o_{j + 1}

(step 14). Otherwise, it analyzes

M a c r o_{j + 1}

in the same way as it did in

M a c r o_{j}

, step 13. If the equations are not satisfied,

M a c r o_{j}

receives the new microcluster containing

v (t)

, step 16.

When it creates or updates a microcluster, the algorithm updates the number (

n_{j}

), the centroid (

C_{j}

), and the list of microclusters (

m i c r o s_{j}

) of the macrocluster. Then, a simple arithmetic mean is used to find the new centroid of the macrocluster.

In step 9, the algorithm calculates the overlap after assigning

v (t)

to a microcluster using the following equation:

d (M i c r o_{w i n}, M i c r o_{i}) - (r_{w i n} + r_{i}) < 0,

(4)

where

M i c r o_{w i n}

and

r_{w i n}

are the winning microcluster and its radius,

M i c r o_{i}

represents all microclusters of the

M a c r o_{j}

, and

r_{i}

is the radius of the other microclusters.

In the next step of the algorithm, if the Equation (4) is true, then the following equation is verified:

d (M i c r o_{w i n}, M i c r o_{i}) < γ,

(5)

where

γ

is a hyperparameter, the merge threshold.

If Equation (5) is satisfied, the algorithm merges the overlapping microclusters by following the equations:

r_{m e r g e d} = d (C_{a}, C_{b}) + m a x (r_{a}, r_{b}),

(6)

n_{m e r g e d} = n_{a} + n_{b},

(7)

C_{m e r g e d} = \frac{C_{a} \cdot n_{a} + C_{b} \cdot n_{b}}{n_{a} + n_{b}} .

(8)

Then, Macro SOStream checks if there is more than one macrocluster, in step 17. If not, sample

v (t + 1)

arrives for analysis, and the flow continues, going back to step 3. However, if so, the algorithm checks whether the existing macroclusters are overlapping, in step 18.

For the overlap analysis of macroclusters in step 18, the algorithm uses the microcluster that received

v (t)

. From there, it seeks the nearest microcluster that belongs to the other macroclusters. Therefore, the algorithm checks whether there is an overlap of the microclusters using Equations (4) and (5).

Figure 4 illustrates three microclusters’ and macroclusters’ situations. The small spheres are microclusters, and dashed lines represent the macroclusters. In Figure 4a, the macroclusters are far from each other. In Figure 4b, suppose that

v (t)

belongs to one of the large microclusters. Thus, Macro SOStream perceives that it is close to another macrocluster. If equations Equations (4) and (5) are satisfied, the macroclusters merge, resulting in Figure 4c.

Macro SOStream uses the Equations (7) and (8) to merge the clusters and join the microclusters’ lists of the macroclusters. Then, the algorithm receives

v (t + 1)

, and the whole cycle is repeated.

As the Macro SOStream is based on SOStream, SOStream also has some steps in Figure 2. The steps are: 1, 2, 3, 4, 5, 6, 7, 8, 9, 16, and 19. The other steps are only present in Macro SOStream because they are related to the creation and merging of macroclusters. In addition, the complexity of Macro SOStream is

O (m n k log k)

, where k is the numbers of samples, n is the number of microclusters, and m is the number of macroclusters.

4. Experimental Results and Discussions

This paper’s experimental configurations are explained below, indicating the used datasets and adopted evaluation metrics. The results obtained with Macro SOStream will also be presented with SOStream and DenStream, and the algorithms’ average execution time, for comparison purposes.

4.1. Experimental Setup

Four synthetic datasets https://github.com/AndressaStefany/datastream/tree/v1.0, accessed on 18 June 2022 were used to validate the algorithm Macro SOStream. Two of them come from Scikit-Learn [36], a well-known Python packet for machine learning. The authors generated the third dataset, and the fourth dataset was proposed in [37]. In addition, the Adjusted Rand Index (ARI) metric was used to assess the performance of the clustering algorithms.

4.1.1. Datasets

All the synthetic datasets are 2-dimensional, as illustrated in Figure 5, Figure 6, Figure 7 and Figure 8. The first data stream is Circle data of the Scikit-Learn, modified as shown in Figure 5b. Initially, 5000 samples were generated from the Circle dataset with the scale factor set to

0.5

and noise level

0.05

. Then, only the samples with y values less than

0.5

and

0.2

for the outer and inner ring are used, respectively. Figure 5a represents the order that the samples arrive in the algorithm, and this data stream is named Horseshoe. Finally, this dataset has 3744 samples and values between

(- 1.11, 0.12)

and

(1.13, 2.07)

.

The second dataset, named T, is illustrated in Figure 6b. This dataset starts with 100 samples of the tip numbered as one and posteriorly as two, taking 100 samples of each class repeatedly. It has 3744 samples belonging to only a cluster in its final configuration, as shown in Table 2.

The third dataset is named Moon in Scikit-Learn [36]. In total, there are 5000 samples enerated as described in Table 2. The noise level is set at

0.05

, and the scale factor is

0.5

. Furthermore, for the data to be a data stream, the samples’ order has been modified. The algorithm first receives the first 100 samples from the place numbered with number one in Figure 7a. After this, it starts number two with the first 100 samples, and so on.

Finally, the Three Spirals dataset [37] had 312 samples initially, but this dataset was modified to be denser by averaging old samples, resulting in 621 samples. Each spiral of Figure 8b represents a class, for a total of three classes. Figure 8a shows the order of the data stream. Initially, the algorithm receives the class 1 samples in the direction outside to the interior of the spiral. This is also repeated for classes 2 and 3, respectively.

4.1.2. Evaluation Metric

The Adjusted Rand Score (ARI) metric is the Rand Index (RI) metric adjusted for chance. The ARI metric measures the similarity between the actual labels and the labels obtained through the algorithm so that the permutations present in the clusters’ labels are ignored. That is, it is possible to rename labels 0 and 1 to 2 and 3 and obtain the same score, where the score

1.0

represents a perfect label and the lower its value (

0.0

or negative numbers) the more incorrect the labels are [38,39].

The following equation defines the RI metric:

R I = \frac{a + b}{C_{2}^{n_{s a m p l e s}}},

(9)

where a represents the number of pairs of elements that are in the same set in the actual labels and labels obtained, b is the number of pairs of elements that are in different groups in the set of actual labels and labels obtained, and

C_{2}^{n_{s a m p l e s}}

is the total number of possible combinations in the dataset, disregarding the order.

Then, the ARI metric is defined as:

A R I = \frac{R I - E [R I]}{m a x (R I) - E [R I]},

(10)

where

E [R I]

param is the expected

R I

, and

m a x (R I)

is the maximum

R I

, where an upper limit usually replaces it [38].

4.2. Results

In each experiment’s execution with the different datasets, there was an exhaustive search for each of the analyzed algorithms’ hyperparameters. The final choice for these hyperparameters was made, taking into account the ARI metric score’s return. All results presented in this section were obtained using the Google collaboratory platform [40]. The algorithms were implemented in Python https://github.com/AndressaStefany/evolving_systems/tree/v1.0, accessed on 18 June 2022.

4.3. Macro SOStream

Initially, there was validation of the create function of the macroclusters. For this, we used the Horseshoe dataset with

α = 0.001

,

η = 2

,

γ = 0.01

, and

p = 1.5

. Figure 9 illustrates how clustering occurs. The first macrocluster is created with the data from the outer ring of the dataset, as shown in Figure 9a. Although the samples exhibit the behavior of data drift over time, the algorithm identifies that they belong to the same macrocluster. Only when the inner ring samples arrive at the algorithm is a new macrocluster created, as shown in Figure 9b. The Horseshoe dataset clustering result is illustrated in Figure 9c.

In the second experiment, there was the validation of the merge function of the macroclusters. For that, we used the dataset named T with the settings of

α = 0.001

,

η = 2

,

γ = 0.0001

, and

p = 2.5

. Figure 10 illustrates clustering behavior, where in Figure 10a shows that the algorithm detects the presence of two macroclusters initially. Over time, it realizes that it was just one macrocluster, as shown in Figure 10b. Figure 10c illustrates the microclusters identified of the macrocluster.

Another dataset is the Moon with

α = 0.001

,

η = 2

,

γ = 0.01

, and

p = 2.5

. We can see in Figure 11a that the algorithm detected four macroclusters initially because of the samples’ order. After some macroclusters are merged, it results in only two macroclusters, as illustrated in Figure 11b. Figure 11c shows the microclusters.

The best hyper-parameterization of Macro SOStream to clustering the Three Spirals dataset is

α = 0.65

,

η = 3

,

γ = 0.0001

, and

p = 2.0

. Figure 12a shows the clustering of the first 261 samples, with the two macroclusters identified. Figure 12b illustrates the final result of the clustering, and Figure 12c presents the microclusters found. Thus, the Macro SOStream clustered correctly, even with the concept drift from the Three Spirals dataset.

We exhaustively searched for the optimal hyperparameters. We are used the graph Parallel Categories Diagram of the Plotly package to evaluate the performance. Figure 13a shows the variance of the Macro SOStream hyperparameters and ARI metric for the Horseshoe dataset. The graph columns represent the hyperparameters; the lines are the connections between the hyperparameters’ values. Moreover, the colors of the lines illustrate the ARI metric value to the configuration determined. It is possible to see the effect of the

η

as the minimum number of clusters, where

η

equal to 2 performs better. Moreover, the hyperparameter p equal

1.5

may have been decisive in the ARI’s best value.

In the clustering with the T dataset, as illustrated in Figure 13b, most hyperparameters result

A R I = 1.0

. This excellent performance could be because the T dataset has just one cluster, resulting from the merging function application.

The Macro SOStream hyperparameters’ variation with the Moon dataset shows a behavior similar to the Horseshoe dataset variation. Figure 13c shows that the

η

and p values are decisive in obtaining a good ARI metric value. When

η

is equal to 2, and the p values are between 2 and 3, the performance is better. However, the hyperparameter intervals with the great ARI metrics are lower with the Horseshoe dataset than with the Moon dataset variation.

Regarding the Three Spirals dataset, the

η

and p hyperparameters are still decisive for a high ARI value, as illustrated in Figure 13d. The difference between Figure 13d and other figures of parallel coordinates is the

α

hyperparameter. Its values were bigger because of the magnitude of the samples in the Three Spirals datasets.

4.4. SOStream

The SOStream algorithm clustered the Horseshoe, T, Moon, and Three Spirals datasets as shown in Figure 14. In the Horseshoe dataset, four microclusters were found with the settings of

α = 0.015

,

M i n P t s = 4

, and

γ = 0.1

. In the T dataset, SOStream identifies six microclusters using the settings of

α = 0.001

,

M i n P t s = 5

, and

γ = 0.1

, while in the Moon dataset, four microclusters were found with the settings of

α = 0.001

,

M i n P t s = 5

, and

γ = 0.1

. Finally, the algorithm found 226 microclusters in the Three Spirals dataset with

α = 1

,

M i n P t s = 2

, and

γ = 0.3

. Macro SOStream performed an exhaustive search for the best parameters. However, in both datasets the ARI metric maximum values were around

0.38

, as illustrated in Figure 15.

The poor performance of the SOStream occurred because this algorithm works only with spherical microclusters. Thus, because the datasets have arbitrarily-formed clusters, SOStream cannot find clusters correctly. Moreover, to better compare the methods, the hyperparameters’ values used are the same as those utilized in Macro SOStream. However, it was noted that there was a significant discrepancy in the ARI metric values.

4.5. DenStream

Figure 16 shows the clustering of DenStream algorithm with the Horseshoe, T, Moon, and Three Spirals datasets. In the Horseshoe dataset, two macroclusters were found with the settings of

ϵ = 0.01

,

λ = 0.001

,

β = 0.35

and

μ = 6

. In the T dataset, DenStream identified one macrocluster using the settings of

ϵ = 0.01

,

λ = 0.001

,

β = 0.35

and

μ = 3

. In the Moon dataset, two macroclusters were found with the settings of

ϵ = 0.01

,

λ = 0.001

,

β = 0.35

and

μ = 3

. The algorithm obtained maximum ARI in the previous clustering. However, DenStream could not find the clusters in the Three Spirals dataset correctly. For the Three Spirals dataset, the best settings were

ϵ = 2.5

,

λ = 0.01

,

β = 0.35

and

μ = 3

, which resulted in 21 macroclusters and

A R I = 0.20

. Just like the other algorithms, an exhaustive search to the best parameters was performed, as illustrated in Figure 17.

The performance of the algorithm is shown in Figure 17, where the Epsilon (

ϵ

) and Lambda (

λ

) are hyperparameters more critical to the best ARI metric value. In general, the intervals with Epsilon equal to

0.01

and Lambda between

0.001

and

0.01

expressed excellent results, except for Figure 17d. Despite the exhaustive search for the best hyperparameters, it was impossible to find high values for the ARI metric for the Three Spirals dataset, as shown Figure 17d. However, the big problem in DenStream learning is offline. That is, the final macroclusters are found after when all samples have arrived in the algorithm.

4.6. Comparison between Algorithms

Table 3 presents the best ARI metric values with their respective hyperparameters, remembering that there is more than one excellent hyperparameter set in some cases.

We can see that SOStream had the worst result while Macro SOStream and DenStream reach the maximum ARI value. Furthermore, the values of the Macro SOStream and DenStream hyperparameters are similar in both datasets. Although the DenStream has a wide range of hyperparameters with the maximum value of ARI in three of four datasets used, this algorithm has offline learning in two moments, where it uses DBSCAN. In contrast, Macro SOStream has online learning at all times of execution, and even so it manages to reach the maximum ARI value.

Table 4 shows the number of microclusters and macroclusters identified by each algorithm with the settings present in Table 3. Thus, it can be confirmed that Macro SOStream and DenStream correctly matched the number of clusters present in most datasets. The DenStream algorithm only failed when clustering the Three Spirals dataset.

By analyzing the execution time of the evolving algorithms, Table 5 was obtained. For each database used, 100 simulations of the algorithms were performed with the hyperparameters of Table 3. From the simulations, the total time of the algorithms’ execution is calculated in seconds—in the same way, the time spent by each sample is calculated in milliseconds.

An Intel Xeon processor from the family 6 with one physical core, two threads per core configuration, and an operating frequency of 2 GHz was used, in addition to a 13 GB RAM and Ubuntu 18.04.5 operating system LTS. The calculation of the running time was performed using the function process_time() of the Python time module, which returns the system’s sum and CPU time of the process user.

In the clustering of the Horseshoe database, the SOStream algorithm was the one that obtained the best total execution time, spending

12.17

s. However, SOStream produced a low value for the ARI metric, which was

0.376

, as shown in Table 3. In contrast, Macro SOStream and DenStream obtained the maximum value from the ARI metric, and among them, Macro SOStream achieved a shorter processing time, having been

20.90

s.

The SOStream also had the shortest time of execution with the T dataset, with

16.01

s. However, as shown in Table 3 and Table 5, SOStream had the poorest cost. The second algorithm with less costly is the Macro SOStream algorithm, which analyzed the data in

22.31

s. The third algorithm is DenStream with

41.79

s.

Regarding the Moon database, once again, SOStream stood out with

28.79

s of processing time, and the algorithm with the second-best time was Macro SOStream, obtaining

41.84

s. An important detail is that SOStream works only with microclusters, which may be why this algorithm has a faster execution time than the others, which also uses the concept of macroclusters.

Unlike the others, the algorithm with the lowest processing time for clustering the Three Spirals dataset was DenStream with

0.36

s; SOStream and Macro SOStream performed

0.84

and

7.22

s of processing, respectively. One of the differences between the Three Spirals dataset and the other datasets is the number of samples, with the Three Spirals dataset having far fewer. Thus, it is interesting to note in Table 5 that SOStream and Macro SOStream had less processing time for all samples when it came to datasets with the highest number of samples.

In addition to the algorithms’ total execution times, we also obtained the average time that the algorithm took to analyze each sample of the datasets, as shown in the Sample column in Table 5. In the sample column, it is noticed that the values of DenStream are the lowest,

4.15

,

4.13

,

4.65

, and

0.15

milliseconds for the sets Horseshoe, T, Moon, and Three Spirals, respectively. However, this execution time for the DenStream algorithm is only of the online learning phase, as it is impossible to measure each sample’s time in the offline phase. It is also observed that, despite this low measurement per sample, DenStream has the highest processing values of the total time of the datasets. Thus, the predominant factor of DenStream being the most computationally expensive algorithm may be related to the processing time in the algorithm’s offline phases. As shown in Table 5, the execution times for this offline phase were

19.12

,

27.09

,

41.21

, and

0.27

s for Horseshoe, T, Moon, and Three Spirals bases, respectively.

In the analysis time per sample of Macro SOStream and SOStream, it was expected that Macro SOStream’s execution time would be higher than that of SOStream because it added the create function of the macroclusters in the algorithm. Another point to note is that, although the average time per sample of Macro SOStream is about

50 %

greater than the values of DenStream, the total average execution time for Macro SOStream is less than about

35 %

of DenStream’s total, except for the Three Spirals dataset. These results show that, for three of four datasets used, the Macro SOStream stood out in terms of execution time.

5. Conclusions

In this work, we proposed the Macro SOStream, an evolving algorithm for dynamic data stream clustering with online learning. Its performance for clustering tasks was compared to SOStream and DenStream algorithms over well-known synthetic datasets. The results obtained with Macro SOStream were promising because it managed to group the data stream properly, even in data drift and data evolution scenarios.

Macro SOStream is based on SOStream evolving algorithm. However, it adopts the concept of macrocluster that allows the creation of clusters with arbitrary shapes. Due to this modification, the Macro SOStream can obtain a better performance than SOStream. The macrocluster-merging functionality proved to be very relevant for the Macro SOStream algorithm’s good performance. That explains why the Macro SOStream achieved

A R I = 1.0

on the four synthetic bases, while SOStream obtained values around

A R I \approx 0.38

for the Horseshoe, Moon, and Three Spirals datasets, and

A R I = 0.0

for the T dataset.

The hyperparameter-sensitive analysis results indicated that Macro SOStream has a similar performance to DenStream when both use their best hyperparameter configuration. The DenStream has a more significant hyperparameter range than Macro SOStream but presents a difference its learning mode. It has two moments of offline learning during clustering tasks, which influences the total execution time, making it a more costly algorithm than Macro SOStream. In contrast, Macro SOStream has fully online learning throughout its execution and a time execution better than DenStream.

Furthermore, as a limitation of the work, we have the use of synthetic datasets to validate the SOStream Macro. For future work, we plan to apply the Macro SOStream algorithm to real-world datasets to assess their quality and performance in these cases. Another point is the execution time of the proposed algorithm, which is longer than SOStream, as there is one more step than this algorithm, and DenStream, which has offline and online execution time.

Author Contributions

Conceptualization, A.S.O.; methodology, A.S.O.; software, A.S.O. and R.S.d.A.; validation, A.S.O.; investigation, A.S.O.; writing—original draft preparation, A.S.O.; writing—review and editing, R.S.d.A. and L.A.G.; supervision, L.A.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) [Finance Code 001].

Data Availability Statement

The data used can be found at https://github.com/AndressaStefany/datastream/tree/v1.0, accessed on 18 June 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

Alpaydin, E. Introduction to Machine Learning, 2nd ed.; The MIT Press: London, UK, 2010; Volume 1107, pp. 105–128. [Google Scholar]
Pimentel, M.A.; Clifton, D.A.; Clifton, L.; Tarassenko, L. A review of novelty detection. Signal Process. 2014, 99, 215–249. [Google Scholar] [CrossRef]
Crnkić, A.; Ivanović, I.; Jaćimović, V.; Mijajlović, N. Swarms on the 3-sphere for online clustering of multivariate time series and data streams. Future Gener. Comput. Syst. 2020, 112, 11–17. [Google Scholar] [CrossRef]
Nawaratne, R.; Alahakoon, D.; De Silva, D.; Chhetri, P.; Chilamkurti, N. Self-evolving intelligent algorithms for facilitating data interoperability in IoT environments. Future Gener. Comput. Syst. 2018, 86, 421–432. [Google Scholar] [CrossRef]
Muller, A.C.; Guido, S. Introduction to Machine Learning with Python, 1st ed.; O’Reilly: Sebastopol, CA, USA, 2016; pp. 121–130. [Google Scholar]
Angelov, P. Autonomous Learning Systems; John Wiley & Sons, Ltd.: Chichester, UK, 2013. [Google Scholar] [CrossRef]
Kokate, U.; Deshpande, A.; Mahalle, P.; Patil, P. Data Stream Clustering Techniques, Applications, and Models: Comparative Analysis and Discussion. Big Data Cogn. Comput. 2018, 2, 32. [Google Scholar] [CrossRef] [Green Version]
Angelov, P.P.; Gu, X. Empirical Approach to Machine Learning; Studies in Computational Intelligence; Springer International Publishing: Cham, Switzerland, 2019; Volume 800, pp. 2981–2993. [Google Scholar] [CrossRef]
Škrjanc, I.; Iglesias, J.A.; Sanchis, A.; Leite, D.; Lughofer, E.; Gomide, F. Evolving fuzzy and neuro-fuzzy approaches in clustering, regression, identification, and classification: A Survey. Inf. Sci. 2019, 490, 344–368. [Google Scholar] [CrossRef]
Leite, D.; Škrjanc, I.; Gomide, F. An overview on evolving systems and learning from stream data. Evol. Syst. 2020, 11, 181–198. [Google Scholar] [CrossRef]
Aggarwal, C.C.; Yu, P.S.; Han, J.; Wang, J. A Framework for Clustering Evolving Data Streams. In Proceedings of the 2003 VLDB Conference, Berlin, Germany, 9–12 September 2003; Elsevier: Amsterdam, The Netherlands, 2003; pp. 81–92. [Google Scholar] [CrossRef]
Cao, F.; Estert, M.; Qian, W.; Zhou, A. Density-Based Clustering over an Evolving Data Stream with Noise. In Proceedings of the 2006 SIAM International Conference on Data Mining, Bethesda, MD, USA, 20–22 April 2006; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2006; Volume 2006, pp. 328–339. [Google Scholar] [CrossRef] [Green Version]
Ruiz, C.; Menasalvas, E.; Spiliopoulou, M. C-DenStream: Using Domain Knowledge on a Data Stream. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2009; Volume 5808 LNAI, pp. 287–301. [Google Scholar] [CrossRef]
Ren, J.; Ma, R. Density-Based Data Streams Clustering over Sliding Windows. In Proceedings of the 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, Tianjin, China, 14–16 August 2009; Volume 5, pp. 248–252. [Google Scholar] [CrossRef]
Isaksson, C.; Dunham, M.H.; Hahsler, M. SOStream: Self organizing density-based clustering over data stream. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2012; Volume 7376 LNAI, pp. 264–278. [Google Scholar] [CrossRef]
Moshtaghi, M.; Leckie, C.; Bezdek, J.C. Online Clustering of Multivariate Time-series. In Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, FL, USA, 5–7 May 2016; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2016; pp. 360–368. [Google Scholar] [CrossRef] [Green Version]
Bezerra, C.G.; Costa, B.S.J.; Guedes, L.A.; Angelov, P.P. An evolving approach to data streams clustering based on typicality and eccentricity data analytics. Inf. Sci. 2020, 518, 13–28. [Google Scholar] [CrossRef]
Hassani, M.; Spaus, P.; Gaber, M.M.; Seidl, T. Density-based projected clustering of data streams. In Proceedings of the International Conference on Scalable Uncertainty Management, Marburg, Germany, 17–19 September 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 311–324. [Google Scholar] [CrossRef] [Green Version]
Laohakiat, S.; Phimoltares, S.; Lursinsap, C. Hyper-cylindrical micro-clustering for streaming data with unscheduled data removals. Knowl.-Based Syst. 2016, 99, 183–200. [Google Scholar] [CrossRef]
Wattanakitrungroj, N.; Maneeroj, S.; Lursinsap, C. BEstream: Batch Capturing with Elliptic Function for One-Pass Data Stream Clustering. Data Knowl. Eng. 2018, 117, 53–70. [Google Scholar] [CrossRef]
Barbosa Roa, N.; Travé-Massuyès, L.; Grisales-Palacio, V.H. DyClee: Dynamic clustering for tracking evolving environments. Pattern Recognit. 2019, 94, 162–186. [Google Scholar] [CrossRef] [Green Version]
Ahmed, R.; Dalkılıç, G.; Erten, Y. DGStream: High quality and efficiency stream clustering algorithm. Expert Syst. Appl. 2020, 141, 112947. [Google Scholar] [CrossRef]
Gruhl, C.; Sick, B.; Tomforde, S. Novelty detection in continuously changing environments. Future Gener. Comput. Syst. 2021, 114, 138–154. [Google Scholar] [CrossRef]
Lughofer, E.; Angelov, P. Handling drifts and shifts in on-line data streams with evolving fuzzy systems. Appl. Soft Comput. 2011, 11, 2057–2068. [Google Scholar] [CrossRef] [Green Version]
Khamassi, I.; Sayed-Mouchaweh, M.; Hammami, M.; Ghédira, K. Discussion and review on evolving data streams and concept drift adapting. Evol. Syst. 2018, 9, 1–23. [Google Scholar] [CrossRef]
Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. 2014, 46, 1–37. [Google Scholar] [CrossRef]
Bezerra, C.G.; Costa, B.S.J.; Guedes, L.A.; Angelov, P.P. An evolving approach to unsupervised and Real-Time fault detection in industrial processes. Expert Syst. Appl. 2016, 63, 134–144. [Google Scholar] [CrossRef] [Green Version]
Soares, E.; Garcia, C.; Poucas, R.; Camargo, H.; Leite, D. Evolving Fuzzy Set-based and Cloud-based Unsupervised Classifiers for Spam Detection. IEEE Lat. Am. Trans. 2019, 17, 1449–1457. [Google Scholar] [CrossRef]
Sharifi, S.A.; Babamir, S.M. The clustering algorithm for efficient energy management in mobile ad-hoc networks. Comput. Netw. 2020, 166, 106983. [Google Scholar] [CrossRef]
Ud Din, S.; Shao, J.; Kumar, J.; Ali, W.; Liu, J.; Ye, Y. Online reliable semi-supervised learning on evolving data streams. Inf. Sci. 2020, 525, 153–171. [Google Scholar] [CrossRef]
Ntoutsi, I.; Zimek, A.; Palpanas, T.; Kröger, P.; Kriegel, H.P. Density-based Projected Clustering over High Dimensional Data Streams. In Proceedings of the 2012 SIAM International Conference on Data Mining, Anaheim, CA, USA, 26–28 April 2012; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2012; pp. 987–998. [Google Scholar] [CrossRef] [Green Version]
Angelov, P. Anomaly detection based on eccentricity analysis. In Proceedings of the 2014 IEEE Symposium on Evolving and Autonomous Learning Systems (EALS), Orlando, FL, USA, 9–12 December 2014; pp. 1–8. [Google Scholar] [CrossRef]
Wattanakitrungroj, N.; Maneeroj, S.; Lursinsap, C. Versatile Hyper-Elliptic Clustering Approach for Streaming Data Based on One-Pass-Thrown-Away Learning. J. Classif. 2017, 34, 108–147. [Google Scholar] [CrossRef]
Fahy, C.; Yang, S.; Gongora, M. Ant Colony Stream Clustering: A Fast Density Clustering Algorithm for Dynamic Data Streams. IEEE Trans. Cybern. 2019, 49, 2215–2228. [Google Scholar] [CrossRef] [PubMed]
Asan, U.; Soyer, A.; Serdarasan, S. Computational Intelligence Systems in Industrial Engineering; Atlantis Computational Intelligence Systems; Atlantis Press: Paris, France, 2012; Volume 6, pp. 469–479. [Google Scholar] [CrossRef]
Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; et al. API design for machine learning software: Experiences from the scikit-learn project. In Proceedings of the ECML PKDD Workshop: Languages for Data Mining and Machine Learning, Prague, Czech Republic, 23–27 September 2013; pp. 108–122. [Google Scholar]
Chang, H.; Yeung, D.Y. Robust path-based spectral clustering. Pattern Recognit. 2008, 41, 191–203. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Bisong, E. Google Colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Apress: Berkeley, CA, USA, 2019; pp. 59–64. [Google Scholar] [CrossRef]

Figure 1. Four types of patterns of changes over time.

Figure 2. Diagram of the Macro SOStream algorithm, where the representation (XX) indicates the steps cited from the text.

Figure 3. Clustering process in Macro SOStream. (a) Macrocluster with arbitrary form. (b) Macroclusters with one and many microclusters. (c) Evolution of macroclusters over time.

Figure 4. Merging process in Macro SOStream. (a) Two arbitrary-shaped macroclusters. (b) The two macroclusters approached with the arrival of new samples. (c) A macrocluster as a result of merging the two old macroclusters represented in (b).

Figure 5. The Horseshoe synthetic dataset. (a) The first 1870 samples from the Horseshoe dataset. (b)The Horseshoe dataset.

Figure 6. The T synthetic dataset. (a) The first 1800 samples from the T dataset. (b) The T dataset.

Figure 7. The Moon synthetic dataset. (a) The first 3500 samples from the Moon dataset. (b) The Moon dataset.

Figure 8. The Three Spirals synthetic dataset. (a) The first 261 samples from the Three Spirals dataset. (b) The Three Spirals dataset.

Figure 9. Validation of the Macro SOStream create function with the Horseshoe dataset. (a) Clustering of the first 800 samples. (b) Clustering of the first 1950 samples. (c) Final result, where different colors represent different macroclusters.

Figure 10. Validation of the Macro SOStream merge function with the T dataset. (a) Clustering of the first 1800 samples forming two distinct macroclusters, represented by different colors. (b) Final result. (c) Representation of the microclusters.

Figure 11. Clustering of the Moon dataset with Macro SOStream. (a) Clustering of the first 3500 samples forming four distinct macroclusters, represented by different colors. (b) Final result with two macroclusters. (c) Representation of the microclusters.

Figure 12. Clustering of the Three Spirals dataset with Macro SOStream. (a) Clustering of the first 261 samples forming two distinct macroclusters, represented by different colors. (b) Final result with three macroclusters. (c) Representation of the microclusters.

Figure 13. Variation of Macro SOStream hyperparameters and ARI metric values. (a) Clustering with the Horseshoe dataset. (b) Clustering with the T dataset. (c) Clustering with the Moon dataset. (d) Clustering with the Three Spirals dataset.

Figure 14. Clustering using SOStream algorithm. (a) Clustering with the Horseshoe dataset; (b) Clustering with the T dataset; (c) Clustering with the Moon dataset; (d) Clustering with the Three Spirals dataset.

Figure 15. Variation of SOStream hyperparameters and ARI metric values. (a) Clustering with the Horseshoe dataset; (b) Clustering with the T dataset; (c) Clustering with the Moon dataset; (d) Clustering with the Three Spirals dataset.

Figure 16. Clustering using DenStream algorithm. (a) Clustering with the Horseshoe dataset; (b) Clustering with the T dataset; (c) Clustering with the Moon dataset; (d) Clustering with the Three Spirals dataset.

Figure 17. Variation of DenStream hyperparameters and ARI metric values. (a) Clustering with the Horseshoe dataset; (b) Clustering with the T dataset; (c) Clustering with the Moon dataset; (d) Clustering with the Three Spirals dataset.

Table 1. Summary of evolving algorithms.

Algorithm	Learning	Merge	Cluster
Algorithm	Learning	Merge	Micro	Macro
CluStream [11]	Offline & Online	Yes	Spherical	Spherical
DenStream [12]	Offline & Online	Yes	Spherical	Arbitrary
C-DenStream [13]	Offline & Online	Yes	Spherical	Arbitrary
SDStream [14]	Offline & Online	Yes	Spherical	Arbitrary
HDDStream [31]	Offline & Online	Yes	Spherical	Arbitrary
PreDeConStream [18]	Offline & Online	Yes	Spherical	Arbitrary
ACSC [34]	Offline & Online	Yes	Spherical	Arbitrary
VHEC [33]	Offline & Online	Yes	Elliptical	-
BEstream [20]	Offline & Online	Yes	Elliptical	-
DGStream [22]	Offline & Online	No	Grid	Arbitrary
TEDA [32]	Online	No	-	Arbitrary
OEC [16]	Online	No	-	Elliptical
SwarmStream [3]	Online	No	-	Arbitrary
Auto-Cloud [17]	Online	Yes	-	Spherical
SOStream [15]	Online	Yes	Spherical	-
HCMstream [19]	Online	Yes	Sphere/Cylinder	Arbitrary
DyClee [21]	Online	Yes	Hyperbox	Arbitrary
Macro SOStream	Online	Yes	Spherical	Arbitrary

Table 2. The three sets of synthetic data used.

Data	N° of Samples	N° of Clusters
Horseshoe	3247	2
T	3744	1
Moon	5000	2
Three Spirals	621	3

Table 3. The algorithms’ performances.

Algorithm	Dataset	ARI	Hyperparameters
Macro	Horseshoe	$1.0$	$α = 0.001; η = 2; γ = 0.01; p = 1.5$
SOStream	T	$1.0$	$α = 0.001; η = 2; γ = 0.0001; p = 2.5$
	Moon	$1.0$	$α = 0.001; η = 2; γ = 0.01; p = 2.5$
	Three Spirals	$1.0$	$α = 0.65; η = 3; γ = 0.0001; p = 2.0$
SOStream	Horseshoe	$0.358$	$α = 0.015; M i n P t s = 4; γ = 0.1$
	T	$0.0$	$α = 0.001; M i n P t s = 5; γ = 0.1$
	Moon	$0.302$	$α = 0.001; M i n P t s = 5; γ = 0.1$
	Three Spirals	$0.416$	$α = 1; M i n P t s = 2; γ = 0.3$
DenStream	Horseshoe	$1.0$	$ϵ = 0.01; λ = 0.001; β = 0.35; μ = 6$
	T	$1.0$	$ϵ = 0.01; λ = 0.001; β = 0.35; μ = 3$
	Moon	$1.0$	$ϵ = 0.01; λ = 0.001; β = 0.35; μ = 3$
	Three Spirals	$0.201$	$ϵ = 2.5; λ = 0.01; β = 0.35; μ = 3$

Table 4. The number of clusters found by evolving algorithms.

Algorithm	Dataset	Microclusters	Macroclusters
Macro SOStream	Horseshoe	120	2
	T	81	1
	Moon	129	2
	Three Spirals	232	3
SOStream	Horseshoe	4	-
	T	6	-
	Moon	5	-
	Three Spirals	226	-
DenStream	Horseshoe	502	2
	T	888	1
	Moon	1172	2
	Three Spirals	21	21

Table 5. Average processing time of evolving algorithms.

Algorithm	Dataset	Time (s)	Offline (s)	Sample (ms)
Macro SOStream	Horseshoe	$20.90$	-	$6.51$
	T	$22.31$	-	$5.84$
	Moon	$41.84$	-	$8.37$
	Three Spirals	$7.22$	-	$4.65$
SOStream	Horseshoe	$12.17$	-	$3.74$
	T	$16.01$	-	$4.30$
	Moon	$28.79$	-	$5.80$
	Three Spirals	$0.84$	-	$1.35$
DenStream	Horseshoe	$33.21$	$19.12$	$4.15$ ¹
	T	$41.79$	$27.09$	$4.13$ ¹
	Moon	$65.13$	$41.21$	$4.65$ ¹
	Three Spirals	$0.36$	$0.27$	$0.15$ ¹

¹ Samples in the online phase.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oliveira, A.S.; de Abreu, R.S.; Guedes, L.A. Macro SOStream: An Evolving Algorithm to Self Organizing Density-Based Clustering with Micro and Macroclusters. Appl. Sci. 2022, 12, 7161. https://doi.org/10.3390/app12147161

AMA Style

Oliveira AS, de Abreu RS, Guedes LA. Macro SOStream: An Evolving Algorithm to Self Organizing Density-Based Clustering with Micro and Macroclusters. Applied Sciences. 2022; 12(14):7161. https://doi.org/10.3390/app12147161

Chicago/Turabian Style

Oliveira, Andressa Stéfany, Rute Souza de Abreu, and Luiz Affonso Guedes. 2022. "Macro SOStream: An Evolving Algorithm to Self Organizing Density-Based Clustering with Micro and Macroclusters" Applied Sciences 12, no. 14: 7161. https://doi.org/10.3390/app12147161

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Macro SOStream: An Evolving Algorithm to Self Organizing Density-Based Clustering with Micro and Macroclusters

Abstract

1. Introduction

2. Background and Related Work

3. Macro SOStream Algorithm

4. Experimental Results and Discussions

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Evaluation Metric

4.2. Results

4.3. Macro SOStream

4.4. SOStream

4.5. DenStream

4.6. Comparison between Algorithms

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI