An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining

Miloudi, Salim; Wang, Yulin; Ding, Wenjia

doi:10.3390/e23050553

Open AccessArticle

An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining

by

Salim Miloudi

^*

,

Yulin Wang

^*

and

Wenjia Ding

^*

School of Computer Science, Wuhan University, Wuhan 430072, China

^*

Authors to whom correspondence should be addressed.

Entropy 2021, 23(5), 553; https://doi.org/10.3390/e23050553

Submission received: 7 April 2021 / Revised: 22 April 2021 / Accepted: 26 April 2021 / Published: 29 April 2021

Download

Browse Figures

Versions Notes

Abstract

:

Clustering algorithms for multi-database mining (MDM) rely on computing

(n^{2} - n) / 2

pairwise similarities between n multiple databases to generate and evaluate

m \in [1, (n^{2} - n) / 2]

candidate clusterings in order to select the ideal partitioning that optimizes a predefined goodness measure. However, when these pairwise similarities are distributed around the mean value, the clustering algorithm becomes indecisive when choosing what database pairs are considered eligible to be grouped together. Consequently, a trivial result is produced by putting all the n databases in one cluster or by returning n singleton clusters. To tackle the latter problem, we propose a learning algorithm to reduce the fuzziness of the similarity matrix by minimizing a weighted binary entropy loss function via gradient descent and back-propagation. As a result, the learned model will improve the certainty of the clustering algorithm by correctly identifying the optimal database clusters. Additionally, in contrast to gradient-based clustering algorithms, which are sensitive to the choice of the learning rate and require more iterations to converge, we propose a learning-rate-free algorithm to assess the candidate clusterings generated on the fly in fewer upper-bounded iterations. To achieve our goal, we use coordinate descent (CD) and back-propagation to search for the optimal clustering of the n multiple database in a way that minimizes a convex clustering quality measure

L (θ)

in less than

(n^{2} - n) / 2

iterations. By using a max-heap data structure within our CD algorithm, we optimally choose the largest weight variable

θ_{p, q}^{(i)}

at each iteration i such that taking the partial derivative of

L (θ)

with respect to

θ_{p, q}^{(i)}

allows us to attain the next steepest descent minimizing

L (θ)

without using a learning rate. Through a series of experiments on multiple database samples, we show that our algorithm outperforms the existing clustering algorithms for MDM.

Keywords:

coordinate descent; clustering; multi-database mining; fuzziness; binary entropy loss; similarity matrix

1. Introduction

Large multi-branch companies need to analyze multiple databases to discover useful patterns for the decision-making process. To make global decisions for the entire company, the traditional approach suggests to merge and integrate the local branch-databases into a huge data warehouse, and then one can apply data mining algorithms [1] to the accumulated dataset to mine the global patterns useful for all the branches of the company. However, there are some limitations associated with this approach. For instance, the cost of moving the data over the network, and integrating and storing potentially heterogeneous databases could be expensive. Moreover, some branches may not accept sharing their raw data due to the underlying privacy issues. More crucially, integrating a large amount of irrelevant data can easily disguise some essential patterns hidden in multiple databases. To tackle the latter problems, it is suggested to keep the transactional data stored locally and only forward the local patterns mined at each branch database to a central site where they will be clustered into disjoint cohesive pattern-base groups for knowledge discovery. In fact, analyzing the local patterns present in each individual cluster of the multiple databases (MDB) enhances the quality of aggregating novel relevant patterns, and also facilitates the parallel maintenance of the obtained database clusters.Various clustering algorithms and models have been introduced in the literature, namely spectral-based models [2], hierarchical [3], partitioning [4], competitive learning-based models [5,6,7] and artificial neural networks (ANNs) based clustering [8,9,10]. Additionally, clustering could be applied in many domains [11,12] including community discovery in social networks [13,14], image segmentation [15,16] and recommendation systems [17,18,19]. In this article, we focus on exploring similarity-based clustering models for multi-database mining [20,21,22,23], due to their stability, simplicity [24] and robustness in partitioning graphs of n multiple databases into k connected components consisting of similar database objects. Nevertheless, the existing clustering quality measures in [20,21,22,23] are non-convex objectives suffering from the existence of local optima. Consequently, identifying the optimal clustering may be a difficult task, as it requires evaluating all the candidate clusterings generated at all the local optima in order to find the ideal clustering.

To address the issues associated with clustroid initialization, preselection of a suitable number of clusters and non-convexity of the clustering quality objectives, we proposed in [25,26] an algorithm named GDMDBClustering, which minimizes a quasi-convex loss function quantifying the quality of the multi-database clustering, without a priori assumptions about which number of clusters should be chosen. Therefore, in contrast to the clustering models proposed in [20,21,22,23], GDMDBClustering [25] does not require us to produce and assess all the possible candidate classifications in order to find the optimal partitioning. Alternatively, each partitioning is assessed on the fly as it is generated and the clustering algorithm terminates right after attaining the global minimum of the objective function. However, the existing gradient-based clustering algorithms [25,26] are strongly dependent on the choice of the learning rate

η

, which influences the number of learning cycles required to find the optimal partitioning. In fact, selecting a larger

η

value may cause global minimum overshooting and setting a smaller

η

value may necessitate many learning iterations for the algorithm to converge.

In this paper, we improve upon previous work [25,26] and propose a learning-rate-free (i.e., independent of the learning rate

η

) algorithm requiring fewer upper-bounded iterations (i.e., the maximum number of iterations is at most

(n^{2} - n) / 2

) to minimize a clustering convex loss function

L (θ)

using coordinate descent (CD) and back-propagation. Precisely, our proposed algorithm minimizes a quadratic hinge-based loss

L (θ)

over the first largest coordinate variable

θ_{p, q}

while keeping the rest of the

(\frac{n}{2}) - 1

variables fixed. Then, it minimizes

L (θ)

over the second largest coordinate variable while keeping the rest of the

(\frac{n}{2}) - 1

variables fixed, and so on until convergence or until cycling through all the

(\frac{n}{2})

coordinate variables. Consequently, our algorithm becomes faster than GDMDBClustering [25] which is dependent on a learning rate and also requires us to minimize the cost over a large set of variables at each iteration. This can be a very challenging problem in contrast to minimizing the loss over one single variable at a time while keeping all the other dimensions fixed.

On the other hand, existing clustering algorithms for multi-database mining (MDM) [20,21,22,23,25,26] proceed by computing

(n^{2} - n) / 2

pairwise similarities

s i m (D_{p}, D_{q}) \in [0, 1]

between n multiple databases, and then use these values to generate and evaluate

m \in [1, (n^{2} - n) / 2]

candidate clusterings in order to select the ideal partitioning optimizing a given goodness measure. However, when

s i m {(D_{p}, D_{q})}_{n \times n}

(

p = 0, \dots, n - 2, q = p + 1, \dots, n - 1

) are distributed around the mean value

μ = 0.5

, the fuzziness index of the similarity matrix increases and the clustering algorithm becomes uncertain when choosing what database pairs are considered similar and hence eligible to be put into the same cluster. Consequently, a trivial result is produced, i.e., putting all the n databases in one cluster or returning n singleton clusters. To tackle the latter problem, we propose a learning algorithm to reduce the fuzziness in the pairwise similarities by minimizing a weighted binary entropy loss function

H (\cdot)

via gradient descent and back-propagation. Precisely, the learned model will force the similarity values above 0.5 to go closer to their maximum value (≈1), and let those below 0.5 go closer to their minimum value (≈0) in a way that minimizes

H (\cdot)

. This will significantly reduce the associated fuzziness and improve the certainty of the clustering algorithm to correctly identify the optimal database clusters. The main contributions of this article are listed as follows:

Unlike the existing algorithms proposed in [20,21,22,23,25,26] where one-class trivial clusterings are produced when the similarity values are centered around the mean value, we have added a preprocessing layer prior to clustering where the pairwise similarities are adjusted to reduce the associated fuzziness and hence improve the quality of the produced clustering. Our experimental results show that reducing the fuzziness of the similarity matrix helps generating meaningful relevant clusters that are different to the one-class trivial clusterings.
Unlike the multi-database clustering algorithms proposed in [20,21,22,23], our approach uses a convex objective function $L (θ)$ to assess the quality of the produced clustering. This allows our algorithm to terminate just after attaining the global minimum of the objective function (i.e., after exploring fewer similarity levels). Consequently, this avoids generating unnecessary candidate clusterings, and hence reduces the CPU overhead. On the other hand, the clustering algorithms in [20,21,22,23] use non-convex objectives (i.e., they suffer from the existence of local optima due to the use of more than two monotonic functions), and therefore require generating and evaluating all the $(n^{2} - n) / 2$ local candidate clustering solutions in order to find the clustering located at the global optimum.
Furthermore, unlike the previous gradient-based clustering algorithms [25,26], our proposed algorithm is leaning-rate-free (i.e., independent of the learning rate), and needs at most (in the worst case) $(n^{2} - n) / 2$ iterations to converge. That is why our proposed algorithm is faster than GDMDBClustering [25], which is strongly dependent on the learning step size $η$ and its decay rate.
Additionally, unlike the similarity measure proposed in [20], which assumes that the same threshold was used to mine the local patterns from the n transactional databases, our proposed similarity measure takes into account the existence of n different local thresholds, which are then combined to calculate a new threshold for each cluster. Afterward, using the new thresholds, our similarity measure accurately estimates the valid patterns post-mined from each cluster in order to compute the $(n^{2} - n) / 2$ pairwise similarities.
The experiments carried out on real, synthetic and randomly generated datasets show that the proposed clustering algorithm outperforms the compared clustering models in [20,21,22,23,25,26], as it has the shorted average running time and the lowest average clustering error.

The remainder of this paper is organized as follows: Section 2 presents an example motivating the importance of clustering for multi-database mining (MDM) and also reviews traditional clustering algorithms for MDM. Section 3 defines the main concepts related to similarity-based clustering and then introduces the proposed approach and its main components. Section 4 presents and analyzes the experimental results. Finally, Section 5 draws conclusions and highlights potential future work.

2. Motivation and Related Work

2.1. Motivating Example

Prior to mining the multiple databases (MDB) of a multi-branch enterprise, it is essential to cluster these MDB into disjoint and cohesive pattern-base groups sharing an important number of local patterns in common. Then, using local pattern analysis and pattern synthesizing techniques [27,28,29,30], one can examine the local patterns in each individual cluster to discover novel patterns, including the exceptional patterns [31] and the high-vote patterns [32], which are extremely useful when it comes to making special targeted decisions regarding each cluster of branches of the same corporation. In the following example, we show the impact of clustering the multi-databases of a multi-branch corporation prior to multi-database mining. Consider the six transactional databases

D = \cup_{p = 1}^{6} {D_{p}}

shown in Table 1, where each database

D_{p}

records a set of transactions enclosed in parentheses and each transaction contains a set of items separated by commas. Consider a minimum support threshold

α = 0.5

. The local frequent itemsets, denoted by

F I S (D_{p}, α)

, and discovered from each database

D_{p}

are shown in Table 2, such that

I_{k}

in each tuple

〈 I_{k}, s u p p (I_{k}, D_{p}) 〉

of

F I S (D_{p}, α)

is the frequent itemset name and

s u p p (I_{k}, D_{p})

, named

s u p p o r t

, is the ratio of the number of transactions in

D_{p}

containing

I_{k}

to the total number of transactions in

D_{p}

.

Now, the global support of each itemset

I_{k} \in \cup_{p = 1}^{6} {F I S (D_{p}, 0.5)}

is calculated via the synthesizing equation [33] defined as follows:

\begin{matrix} s u p p (I_{k}, D) & = \frac{\sum_{p = 1}^{n} | D_{p} | \times s u p p (I_{k}, D_{p})}{\sum_{p = 1}^{n} | D_{p} |} \end{matrix}

(1)

where

n = 6

is the total number of databases in

D

and

| D_{p} |

is the number of transactions in

D_{p}

. For instance, we can calculate the global support of the itemset A as follows:

\begin{matrix} s u p p (A, D) & = \frac{0.75 \times 4 + 0.8 \times 5 + 0.5 \times 4 + 0 \times 3 + 0 \times 4 + 0 \times 4}{4 + 5 + 4 + 3 + 4 + 4} \\ = 0.375 < α \end{matrix}

After computing the global supports of the rest of the itemsets using (1), no single novel pattern has been found, i.e.,

\forall I_{k} \in \cup_{p = 1}^{6} {F I S (D_{p}, 0.5)}, s u p p (I_{k}, D) < 0.5

). The reason is that irrelevant patterns were involved in the synthesizing procedure. Now, if we examine the frequent itemsets in Table 2, we observe that some databases share many patterns in common. Precisely, the six databases seem to form two clusters,

C_{1}

=

{D_{1}, D_{2}, D_{3}}

and

C_{2}

=

{D_{4}, D_{5}, D_{6}}

, where each cluster of databases tend to share similar frequent itemsets.

Next, let us use the synthesizing Equation (1) on the frequent itemsets coming from every single cluster

C_{i}

, such that

4 \leq p \leq 6 = n

for cluster

C_{2}

and

1 \leq p \leq 3 = n

for cluster

C_{1}

. This time, new valid frequent itemsets having a support value above the minimum threshold

α

are discovered in the two clusters. In fact,

F I S (C_{2}, 0.5)

=

{〈 F H, 0.727 〉, 〈 F, 0.727 〉, 〈 H, 0.818 〉}

and

F I S (C_{1}, 0.5)

=

{〈 C, 0.769 〉, 〈 B, 0.769 〉, 〈 A, 0.692 〉}

. The obtained patterns show that a percentage of more than

69 %

of the total transactions in the cluster

C_{1}

include the itemsets C, B and A. More than

72 %

of the total transactions in the cluster

C_{2}

include

F H

, F and H. Moreover, some associations between itemsets could be derived as well, such that the itemset

〈 F H, 0.727 〉 \in F I S (C_{2}, 0.5)

suggests that on average, if a customer collects the item H at one of the branches in

C_{2}

, they are likely to also buy the item F with a

\frac{s u p p (F H, C_{2})}{s u p p (H, C_{2})} = 88.87 %

confidence.

The above example demonstrates the importance of clustering the multi-databases into disjoint cohesive clusters before synthesizing the global patterns. In fact, when the local patterns mined from the six databases were analyzed all together, no global pattern could be synthesized. On the other hand, when the six databases were divided into two different clusters and then each cluster was analyzed individually, useful and novel patterns (knowledge) were discovered. Actually, from the discovered knowledge, decision makers and stakeholders are going to have a clear vision about the branches that exhibit similar purchasing behaviors, and hence take useful decision accordingly. In fact, appropriate business decisions may be taken regarding each group of similar branches in order to predict potential purchasing patterns, increase the customer retention rate and convince customers to purchase more services in the future. Consequently, exploring and examining individual clusters of similar local patterns is going to help the discovery of new and relevant patterns capable of improving the decision-making quality.

2.2. Prior Work

The authors in [34] have adopted a divide and conquer mono-database mining approach to accelerate mining global frequent itemsets (FIs) in large transactional databases. In [35,36], the authors have proposed similar work where big transactional databases are divided into k disjoint transaction partitions whose sizes are small enough to be read and loaded to the random access memory. Then, the frequent itemsets (FIs) mined from all the k partitions are synthesized into global FIs using an aggregation function such as the one suggested by the authors in [33]. It is worth noting that for mono-database mining applications, we usually have direct access to the raw data stored in big transactional databases. On the other hand, for multi-database mining (MDM) applications, it is suggested to keep the transactional data stored locally and only forward the local patterns mined at each branch database to a central site where they will be clustered into disjoint cohesive pattern-base groups for knowledge discovery. As a result, the confidential raw data are kept safe, and also the cost associated with transmitting a large amount of data over the network is cut off. Hence, in contrast to clustering the transactional data stored in a single data warehouse, our approach consists of clustering the local patterns mined and forwarded from multi-databases without requiring the number of clusters to be set a priori. Our purpose is to identify the group of databases that share similar patterns, such as the high-vote patterns [32] and the exceptional patterns [31,37,38] that can be used to make specific decisions regarding their corresponding branches. In the traditional clustering approach [34,35,36] applied for mono-database mining, we can only mine the global patterns that are supported by the whole multi-branch company.

The existing clustering algorithms for multi-database [20,21,23,39,40] are based on an agglomerative process that generates hierarchical partitionings at different levels of similarity, where each cluster in a given candidate partitioning is included in another cluster of a partitioning produced at the next similarity level. Regardless of the latter observation, each candidate partitioning is produced without taking into account the use of the clusters generated at the previous similarity levels. As a result, the clustering algorithms in [20,21,23,39,40] unnecessarily reconstruct clusters that have been built at the previous similarity levels. The latter limitation inspired the authors in [22] to design a graph-based algorithm, which maintains the classes produced at prior similarity levels in order to produce new subsequent classes out of them. Despite the fact that the experiments done in [22] showed promising results against the prior work [20,21,23,39,40], these algorithms are based on non-convex functions to evaluate the quality of the produced candidate clusterings. Consequently, finding the ideal clustering for which a non-convex function is optimal may be a difficult problem to solve in a short time.

To face the latter problem, the authors in [26] have transformed the clustering problem into a quasi-convex optimization problem solvable via gradient descent and back-propagation. Consequently, an early stopping of the clustering process occurs right after converging to the global minimum. Hence, by avoiding generating and evaluating unnecessary candidate clusterings, we can significantly reduce the CPU execution time. Even though traditional clustering algorithm such as k-means [4,41] are intuitive, popular and not hard to implement, they remain sensitive to clustroid initialization, preselection of a suitable number of clusters and non-convexity of the clustering quality objective [42]. The silhouette plot [43] could be used to find an appropriate number of clusters, but this requires executing k-means multiple times with different number of clusters in order to find the ideal partitioning maximizing the silhouette objective. As a result, the time performance will be influenced in the case of clustering big high-dimensional datasets. Slightly different, hierarchical-based clustering algorithms [3] build nested hierarchical levels to visualize the relationships between different objects in the form of dendrograms. Then, it is up to the domain expert or to some non-convex metrics to determine at which level the tree diagram should be cut.

Conversely, the optimization problem formulated in [25,26] is quasi-convex. Therefore, convergence to the global optimum is independent of the initial settings. Furthermore, the proposed gradient-based clustering GDMDBClustering [25] does not need to have the number of clusters as a parameter. Alternatively, the number of clusters becomes a parametric function in the main objective. However, GDMDBClustering is based on the choice of a suitable learning rate, i.e., choosing a small learning rate

η

may increase the number of iterations and slow down learning the optimal weights, whereas a large

η

may let the algorithm overshoot the global minimum. To overcome the latter limitation, we propose in this paper a learning-rate-free clustering algorithm, named CDClustering, which minimizes a convex objective function quantifying the clustering quality. For this purpose, we use coordinate descent (CD) and back-propagation to search for the optimal clustering of n multiple database in less than

(n^{2} - n) / 2

iterations and without using a learning rate. This makes our algorithm faster than the previous gradient-based clustering algorithm [25,26] which remains dependent on a learning rate defined based on some prior knowledge of the properties of the loss function. On the other hand, due to the fuzziness of the similarity matrix, which increases when the pairwise similarities are distributed around the mean value, the clustering algorithm becomes indecisive when grouping similar databases together. To face this problem, we design a learning algorithm to adjust the pairwise similarities between n multiple databases, in a way which minimizes a binary entropy loss function quantifying the fuzziness associated with the similarity matrix. Thus, the proposed algorithm becomes crisp in discriminating between the different database clusters.

3. Materials and Methods

In this section, we present our fuzziness reduction model applied to the pairwise similarities between n multiple databases and describe in details our coordinate descent-based clustering approach. Some definitions and notions relevant to this work need to be presented first.

3.1. Background and Relevant Concepts

In this subsection, we define the similarity measure between two transaction databases and present the process of generating and evaluating a given candidate clustering. We also define four clustering validity functions used to evaluate the clustering quality.

3.1.1. Similarity Measure

Each transactional database

D_{p}

is encoded as a hash-table to be defined as follows:

F I S (D_{p}, α_{p}) = {\cup_{k = 1}^{m} 〈 I_{k}, s u p p (I_{k}, D_{p}) 〉 | s u p p (I_{k}, D_{p}) \geq α_{p}}

(2)

where

p = 0, \dots, n - 1

, n is the number of transactional databases, m is the number of frequent itemsets in

D_{p}

,

I_{k}

is the name of the k-th frequent itemset,

s u p p (I_{k}, D_{p}) \in [0, 1]

is the support of

I_{k}

, which is the ratio of the number of rows in

D_{p}

containing

I_{k}

to the total number of rows in

D_{p}

, and

α_{p} \in [0, 1]

is the minimum support threshold corresponding to

D_{p}

, such that

s u p p (I_{k}, D_{p}) \geq α_{p}

. In this paper, FP-Growth [1] algorithm is used to mine the frequent itemsets in each database

D_{p}

as it only requires two passes over the whole database. Our proposed similarity measure is based on maximizing the number of global frequent itemsets (FIs) synthesized from the local FIs in each cluster. Precisely, to measure the similarity between two transactional databases

D_{q}

and

D_{p}

, for

p = 0, \dots, n - 2, q = p + 1, \dots, n - 1

, we define the following function:

s i m (D_{p}, D_{q}) = \frac{\binom{\sum Ψ (I_{k}, {D_{p}, D_{q}})}{k, I_{k} \in {F I S (D_{p}, α_{p}) \cap F I S (D_{q}, α_{q})}}}{| F I S (D_{p}, α_{p}) \cup F I S (D_{q}, α_{q}) |}

(3)

where

\begin{matrix} Ψ (I_{k}, {D_{p}, D_{q}}) = \{\begin{matrix} 1, if s u p p (I_{k}, {D_{p}, D_{q}}) \geq α_{p, q} \\ 0, otherwise \end{matrix} \end{matrix}

(4)

such that

\begin{matrix} s u p p (I_{k}, {D_{p}, D_{q}}) = \frac{s u p p (I_{k}, D_{p}) \times | D_{p} | + s u p p (I_{k}, D_{q}) \times | D_{q} |}{| D_{p} | + | D_{q} |} \end{matrix}

(5)

and

\begin{matrix} α_{p, q} = \frac{α_{p} \times | D_{p} | + α_{q} \times | D_{q} |}{| D_{p} | + | D_{q} |} \end{matrix}

(6)

We note that the operator

| \cdot |

is the cardinality of the set passed in as argument. Multiplying

α_{p}

by

| D_{p} |

returns the minimum number of transactions in which a frequent itemset

I_{k}

should occur in

D_{p}

. Therefore,

α_{p, q}

is the minimum percentage of transactions from the cluster

C_{p, q} = {D_{p}, D_{q}}

containing the itemset

I_{k}

, i.e.,

s u p p (I_{k}, C_{p, q}) \geq α_{p, q}

. In fact, the similarity measure

s i m

in Formula (3) takes into account the local minimum support threshold at each database to calculate a new threshold for each cluster. In this paper, instead of writing ’the similarity measure

s i m

in Formula (3)’, we often write

s i m

(3).

3.1.2. Clustering Generation and Evaluation

Let

C (D, δ_{i})

=

{C_{1}, C_{2}, \dots, C_{k}}

be a candidate clustering of

D

=

{D_{0}, D_{1}, \dots, D_{n - 1}}

produced at a given level of similarity

δ_{i} \in [0, 1]

, such that

\cap_{j = 1}^{k} {C_{j}} = \emptyset

and

\cup_{j = 1}^{k} {C_{j}} = D

. From a graph-theoretic perspective, each cluster

C_{j}

represents a connected component in a similarity graph

G = (D, E)

, and an edge

(D_{p}, D_{q})

is added to the list of edges E if and only if

s i m (D_{p}, D_{q}) \geq δ_{i}

, where

p = 0, \dots, n - 2, q = p + 1, \dots, n - 1

.

Initially,

G = (D, E)

has no edge, i.e.,

E = \emptyset

. Then, at a given similarity level

δ_{i} \in [0, 1]

, edges

(D_{p}, D_{q})

satisfying

s i m (D_{p}, D_{q}) \geq δ_{i}

, are added to E. The level of similarity

δ_{i}

(

i = 1, \dots, m

) is chosen from the list of the m unique sorted pairwise similarities

s i m (D_{p}, D_{q})

computed between the n transaction databases, such that

δ_{1} > δ_{2} > \dots δ_{i - 1} > δ_{i} > δ_{i + 1} > \dots > δ_{m}

and

m \leq (n^{2} - n) / 2

. After adding all the edges

(D_{p}, D_{q})

at

δ_{i}

, each graph component

C_{j}

(j = 1, \dots, k)

will be representing one database cluster in our candidate partitioning

C (D, δ_{i})

. One can then use one of the clustering goodness measures shown in Table 3 to assess the quality of

C (D, δ_{i})

.

Once we generate and evaluate all the

m \leq (n^{2} - n) / 2

candidate clusterings, we report the global optimum (minimum or maximum) of the goodness measure and compare its corresponding clustering with the ground truth if it is known or with the clustering generated at the maximum point of the silhouette coefficient when the ground truth is unknown. In fact, the silhouette coefficient

S C (D) \in [- 1, 1]

proposed in [43,44] (See the last row in Table 3) could be used to verify the correctness of the cluster labels assigned to the n transactional databases. Precisely, a value

S C (D) \approx 1

suggests that the n transactional databases are highly matched to their own clusters and loosely matched to their neighboring clusters.

We should note that each clustering goodness measure in Table 3 depends on more than two monotonic functions. For instance, the quality measure

g o o d n e s s

(see the first row in Table 3) proposed in [20] is based on maximizing both the intra-cluster similarity

W (D)

(which is a non-decreasing function on the interval [0,1]) and the inter-cluster distance

B (D)

(which is a non-increasing function on the interval [0,1]), while minimizing the number of clusters

f (D)

(which is a non-increasing function on the interval [0,1]). Consequently, as it was shown via the experiments done in [25,26], most of the time, the graphs of the objectives functions in Table 3 show a non-convex behavior, which makes identifying the ideal partitioning a hard problem to solve without generating and evaluating all the candidate clusterings generated at the local optima.

3.2. Similarity Matrix Fuzziness Reduction

In this subsection, we present our fuzziness reduction model applied to the pairwise similarities between n multiple databases. Let

z_{p, q} = θ_{p, q} \times x_{p, q}

be a weighted similarity, such that

x_{p, q} = s i m (D_{p}, D_{q})

is the similarity value between

D_{p}

and

D_{q}

using Formula (3) and

θ_{p, q}

is the weight value associated with

x_{p, q}

where

p = 0, \dots, n - 2, q = p + 1, \dots, n - 1

. Let

g : R \to] 0, 1 [

be a continuous piecewise linear activation function and

\partial g

be its partial derivative defined as follows:

\begin{matrix} g (z_{p, q}, ϵ) = max (z_{p, q}, ϵ) - \frac{sgn (z_{p, q} - 1 + ϵ) + 1}{2} (z_{p, q} - 1 + ϵ) \end{matrix}

(7)

\begin{matrix} \frac{\partial g (z_{p, q}, ϵ)}{\partial z_{p, q}} = \frac{sgn (z_{p, q} - ϵ) + 1}{2} - \frac{sgn (z_{p, q} - 1 + ϵ) + 1}{2} \end{matrix}

(8)

The graph plots of

g (z_{p, q}, ϵ)

and

\frac{\partial g (z_{p, q}, ϵ)}{\partial z_{p, q}}

with respect to

z_{p, q}

are depicted in Figure 1a. The parameter

ϵ

ensures that each value

z_{p, q}

is within the range

[ϵ, 1 - ϵ]

such that

ϵ

is a very small number (e.g.,

ϵ = 1 e

–7) forcing

g (z_{p, q}, ϵ)

to be always above 0 and below 1, so that it can be plugged into our log-based loss function defined in (10).

3.2.1. Fuzziness Index

The fuzziness index of the pairwise similarity vector

X^{T}

=

[s i m (D_{0}, D_{1}), s i m (D_{0}, D_{2}), \dots, s i m (D_{n - 2}, D_{n - 1})]

, also known as the entropy of the fuzzy set

X^{T}

[45], and defined from

R^{(\binom{n}{2})}

to

[0, 1]

, is given as follows:

\begin{matrix} F u z z i n e s s (X) & = \frac{- 2}{n^{2} - n} (X^{T} \cdot {log}_{2} (X) + (1 - X^{T}) \cdot {log}_{2} (1 - X)) \\ = \frac{- 2}{n^{2} - n} \sum_{p = 0}^{n - 2} \sum_{q = p + 1}^{n - 1} (s i m (D_{p}, D_{q}) {log}_{2} (s i m (D_{p}, D_{q})) + (1 - s i m (D_{p}, D_{q})) {log}_{2} (1 - s i m (D_{p}, D_{q}))) \end{matrix}

(9)

The smaller the value of

F u z z i n e s s (X)

, the better the clustering performance, and vice-versa. In fact, reducing the fuzziness of the pairwise similarities will lead to a more crisp decision making when it comes to finding the optimal partitioning of the n multiple databases. Particularly, the fuzziness of the similarity matrix increases when the pairwise values are centered around

0.5

, resulting in more confusion when we need to decide whether two databases should be in the same cluster or not.

3.2.2. Proposed Model and Algorithm

To reduce the fuzziness associated with the

(n^{2} - n) / 2

pairwise similarities between the n transaction databases

D

=

{D_{0}, D_{1}, \dots, D_{n - 1}}

, we need to make the similarity values that are above the mean value

μ = 0.5

go closer to 1, and adjust the similarity values that are below

μ = 0.5

to go closer to 0. To do so, we consider the minimization of the sum of the binary entropy loss functions over the

(n^{2} - n) / 2

weighed similarity values

z_{p, q} = θ_{p, q} \times x_{p, q}

as follows:

\begin{matrix} \arg \min_{θ} H (θ, ϵ) & = \arg \min_{θ} \frac{2}{n^{2} - n} \sum_{p = 0}^{n - 2} \sum_{q = p + 1}^{n - 1} H (g (z_{p, q}, ϵ)) \\ = \arg \min_{θ} \frac{- 2}{n^{2} - n} \sum_{p = 0}^{n - 2} \sum_{q = p + 1}^{n - 1} (g (z_{p, q}, ϵ) {log}_{2} (g (z_{p, q}, ϵ)) + (1 - g (z_{p, q}, ϵ)) {log}_{2} (1 - g (z_{p, q}, ϵ))) \\ = \arg \min_{θ} \frac{- 2}{n^{2} - n} (g (θ^{T} ⊙ X^{T}, ϵ) \cdot {log}_{2} (g (θ ⊙ X, ϵ)) + (1 - g (θ^{T} ⊙ X^{T}, ϵ)) \cdot {log}_{2} (1 - g (θ ⊙ X, ϵ))) \end{matrix}

(10)

such that n is the number of databases,

θ^{T} = [θ_{0, 1}, θ_{0, 2}, \dots, θ_{n - 2, n - 1}]

represents the model weight vector,

z_{p, q}

represents the weighted similarity

θ_{p, q} \times s i m (D_{p}, D_{q})

and

g (z_{p, q}, ϵ)

is the activation function defined in (7). The graph plots of

H (g (z_{p, q}, ϵ))

and

\frac{\partial H (g (z_{p, q}, ϵ))}{\partial g (z_{p, q}, ϵ)}

with respect to

g (z_{p, q}, ϵ)

are depicted in Figure 1b. Since the fuzziness of the similarity matrix is influenced by the weights associated with the pairwise similarities, the degree to which a pair of databases

(D_{p}, D_{q})

belongs to the same cluster could be changed by adjusting the corresponding weight

θ_{p, q}

, which is learned by minimizing (10) via gradient descent and back-propagation. The training equations are derived as follows:

θ_{p, q} = θ_{p, q} - η \frac{\partial H (θ, ϵ)}{\partial θ_{p, q}}

(11)

where

\begin{matrix} \frac{\partial H (θ, ϵ)}{\partial θ_{p, q}} & = \frac{- 2}{n^{2} - n} \frac{\partial g (z_{p, q}, ϵ)}{\partial z_{p, q}} \frac{\partial z_{p, q}}{\partial θ_{p, q}} {log}_{2} (\frac{g (z_{p, q}, ϵ)}{1 - g (z_{p, q}, ϵ)}) \\ = \frac{- 2}{n^{2} - n} \frac{\partial g (z_{p, q}, ϵ)}{\partial z_{p, q}} x_{p, q} {log}_{2} (\frac{g (z_{p, q}, ϵ)}{1 - g (z_{p, q}, ϵ)}) \end{matrix}

(12)

Let

η_{0}

and

e p o c h s

be the initial learning rate and the maximum number of learning iterations, respectively. At each epoch i, the current learning rate

η

decreases as follows:

\begin{matrix} η = η_{0} \times (1 - i / e p o c h s) \end{matrix}

(13)

We note that selecting a large learning rate value may cause global minimum overshooting, whereas choosing a small learning rate may necessitate many iterations for the algorithm to converge. Hence, it is reasonable to let the learning rate decrease over time as the algorithm converges to the global minimum. In Figure 2 and Algorithm 1, we present in detail, the framework and the algorithm of the proposed fuzziness reduction model. The proposed learning Algorithm 1: SimFuzzinessReduction keeps adjusting the weight vector

θ

by moving in the opposite direction to the gradient of the loss function

H (θ, ϵ)

until it reaches the maximum number of iteration

e p o c h s

or until the magnitude of the gradient vector becomes below the minimum value

ϵ

. After convergence, we can feed the new similarity values

[g (θ_{0} \times s i m (D_{0}, D_{1}), ϵ), g (θ_{0} \times s i m (D_{0}, D_{2}), ϵ), \dots, g (θ_{0} \times s i m (D_{n - 2}, D_{n - 1}), ϵ)]

to any similarity-based clustering algorithm in order to improve the quality of the produced clustering when the latter is trivial or irrelevant.

Algorithm 1: SimFuzzinessReduction

3.3. Proposed Coordinate Descent-Based Clustering

In this subsection, we present and discuss our proposed loss function and our coordinate descent-based clustering approach in detail. Unlike the gradient-based clustering in [25,26], our algorithm is learning-rate-free and needs to run at most

(n^{2} - n) / 2

learning cycles to converge to the global minimum, such that n is the number of transaction databases. In fact, at each iteration, the largest coordinate variable

θ_{p, q}

is selected and popped from a max-heap data structure (initially built by pushing the

(n^{2} - n) / 2

pairwise similarities onto the heap). Then, we minimize our quadratic convex hinge-based loss

L (θ)

over

θ_{p, q}

which is then adjusted by moving in the opposite direction to the gradient of

L (θ)

. This process continues until satisfying a convergence test, which will be defined later in this subsection. Each bloc of selected coordinate variables

θ_{p, q}

that have the same value will form a set of edges to be added to our graph

G = (D, E)

. Determining the disjoint connected components in G after convergence will allow us to discover the optimal database clusters maximizing the intra-cluster similarity and the inter-cluster distance.

3.3.1. Proposed Loss Function and Algorithm

In order to implement our coordinate descent-based clustering, we propose a quadratic version of the hinge loss

L (θ) : R^{(\binom{n}{2})} \to [0, \frac{n^{2} - n}{4}]

, which is a convex function (see proof of Theorem 1) whose minimization problem is formulated as follows:

\begin{matrix} \arg \min_{θ^{(i)}} L (θ^{(i)}) = \arg \min_{\begin{matrix} θ^{(i)} \end{matrix}} \sum_{r = 0}^{n - 2} \sum_{s = r + 1}^{n - 1} \frac{1}{2} max {(0, 1 - g (θ_{r, s}^{(i)}))}^{2} \end{matrix}

(14)

A simplified 3D graph plot of

L (θ)

is depicted in Figure 3.

Initially, the weight vector

θ^{T}

is set to the

(\binom{n}{2})

pairwise similarities

X^{T} = [s i m (D_{0}, D_{1}), s i m (D_{0}, D_{2}), \dots, s i m (D_{n - 2}, D_{n - 1})]

, and then each weight component of

θ^{T}

is pushed onto a max-heap data structure. At each iteration

i = 1, \dots, (\binom{n}{2})

, the weight

θ_{p, q}^{(i)}

(

p = 0, \dots, n - 2

,

q = p + 1, \dots, n - 1

) associated with the current largest similarity value

s i m (D_{p}, D_{q})

is popped from the max-heap and is updated as follows:

\begin{matrix} θ_{p, q}^{(i)} & = θ_{p, q}^{(i - 1)} - η \frac{\partial L (θ^{(i - 1)})}{\partial θ_{p, q}^{(i - 1)}} \\ = θ_{p, q}^{(i - 1)} + η (1 - g (θ_{p, q}^{(i - 1)})) \frac{\partial g (θ_{p, q}^{(i - 1)})}{\partial θ_{p, q}^{(i - 1)}} \end{matrix}

(15)

Such that

g : R \to [0, 1]

is a differentiable activation function defined as follows:

\begin{matrix} g (θ_{p, q}) = max (θ_{p, q}, 0) - \frac{sgn (θ_{p, q} - 1) + 1}{2} \times (θ_{p, q} - 1) \end{matrix}

(16)

and its partial derivative with respect to the weight

θ_{p, q}

is:

\begin{matrix} \frac{\partial g (θ_{p, q})}{\partial θ_{p, q}} = \frac{sgn (θ_{p, q}) + 1}{2} - \frac{sgn (θ_{p, q} - 1) + 1}{2} \end{matrix}

(17)

We note that

sgn : R \to {- 1, 1}

is the signum function. The usage of

g (\cdot)

ensures that each weight

θ_{p, q}

is within the range

[0, 1]

. As there is no learning rate and schedule to choose for our coordinate descent-based algorithm, we set

η

to 1.

Theorem 1.

L (θ)

(14) is convex satisfying the following inequality [46]:

\begin{matrix} L ((1 - ε) θ^{(i + 1)} + ε θ^{(i)}) & \leq (1 - ε) L (θ^{(i + 1)}) + ε L (θ^{(i)}) \\ for all θ^{(i + 1)}, θ^{(i)} & \in R^{(\binom{n}{2})} with ε \in [0, 1] \end{matrix}

(18)

Proof.

To prove the convexity of

L (θ)

, we can show that its Hessian matrix

H L

is positive semi-definite as follows:

\begin{matrix} H L = \frac{\partial^{2} L}{\partial θ_{p, q} \partial θ_{r, s}} = [\begin{matrix} \frac{\partial^{2} L (θ)}{\partial {θ^{2}}_{0, 1}} = 1, \frac{\partial^{2} L (θ)}{\partial θ_{0, 1} \partial θ_{0, 2}} = 0, \dots, \frac{\partial^{2} L (θ)}{\partial θ_{0, 1} \partial θ_{n - 2, n - 1}} = 0 \\ \frac{\partial^{2} L (θ)}{\partial θ_{0, 2} \partial θ_{0, 1}} = 0, \frac{\partial^{2} L (θ)}{\partial {θ^{2}}_{0, 2}} = 1, \dots, \frac{\partial^{2} L (θ)}{\partial θ_{0, 2} \partial θ_{n - 2, n - 1}} = 0 \\ ⋮ ⋮ ⋱ ⋮ \\ \frac{\partial^{2} L (θ)}{\partial θ_{n - 2, n - 1} \partial θ_{0, 1}} = 0, \frac{\partial^{2} L (θ)}{\partial θ_{n - 2, n - 1} \partial θ_{0, 2}} = 0, \dots, \frac{\partial^{2} L (θ)}{\partial^{2} θ_{n - 2, n - 1}} = 1 \end{matrix}] \end{matrix}

Since

H

is positive semi-definite satisfying

x^{T} H x \geq 0

for all

x \in R^{(\binom{n}{2})}

,

L (θ)

is convex, and therefore guarantees convergence to the global minimum. □

In order to reach the global minimum of

L (θ)

(i.e.,

\min L (θ) = 0

), our learning algorithm needs to set the weight vector

θ

to

\vec{1}

(i.e., the unit vector). Consequently, the intra-cluster similarity will reach its maximum value and all the n databases will be put into the same cluster resulting in a meaningless partitioning. Therefore, in order to prevent this scenario from occurring, we need to assess the clustering quality after popping all the coordinate variables that have the same weight

θ_{p, q}

(i.e., a block of weights having the same value) from the max-heap. This corresponds to generating one candidate clustering by adding the list of edges

(D_{p}, D_{q})

satisfying

s i m (D_{p}, D_{q}) \geq θ_{p, q}

to the graph

G = (D, E)

. Afterward, we need a stopping condition to terminate our algorithm if the current candidate clustering quality is judged to be the optimal one in terms of the similarity-intra cluster

W_{θ^{(i)}} (D)

and the number of clusters

f_{θ^{(i)}} (D)

. For this purpose, we define the following quasi-convex loss function evaluated at the i-th iteration:

\begin{matrix} L (θ^{(i)}) & = \frac{1}{2} {(f_{θ^{(i)}} (D) - W_{θ^{(i)}} (D))}^{2} \\ = \frac{1}{2} {(f_{θ^{(i)}} (D) - φ ({θ^{(i)}}^{T}) \cdot X)}^{2} \\ = \frac{1}{2} {(f_{θ^{(i)}} (D) - \sum_{p = 0}^{n - 2} \sum_{q = p + 1}^{n - 1} s i m (D_{p}, D_{q}) \times φ (θ_{p, q}^{(i)}))}^{2} \end{matrix}

(19)

where

φ : R^{(\binom{n}{2})} \to {0, 1}^{(\binom{n}{2})}

,

φ (θ) = \frac{sgn (θ - 1) + 1}{2}

.

Our algorithm terminates right after it reaches the global minimum of

L (\cdot)

. In other words, if

L (θ^{(i)}) \leq L (θ^{(i - 1)})

, then we continue updating the weight vector, the clustering labels and save the optimal partitioning found so far. Otherwise, the algorithm terminates as it has reached the global minimum

L (θ^{(i - 1)})

, and therefore, the optimal partitioning saved so far is returned as the ideal clustering of the n transactional databases. This stopping condition is only possible due to the quasi-convexity of

L (\cdot)

.

Theorem 2.

L (θ)

(19) is quasi-convex satisfying the following inequality [46]:

\begin{matrix} L ((1 - ε) θ^{(i + 1)} + ε θ^{(i)}) & \leq \max {L (θ^{(i + 1)}), L (θ^{(i)})} \\ for all θ^{(i + 1)}, θ^{(i)} & {\in R}^{(\binom{n}{2})} with ε \in [0, 1] \end{matrix}

(20)

Proof.

To prove the quasi-convexity of

L (θ)

, we need to demonstrate the validity of (20). First, since

f_{θ} (D)

is a decreasing function on the range [0,1], it is then both quasi-concave and quasi-convex satisfying the following:

f ((1 - ε) θ^{(i + 1)} + ε θ^{(i)}) \leq \max {f (θ^{(i + 1)}), f (θ^{(i)})}

for all

θ^{(i + 1)}, θ^{(i)} {\in R}^{(\binom{n}{2})}

with

ε \in [0, 1]

. Since

W_{θ} (D)

is an increasing function on the range [0,1], it is then both quasi-concave and quasi-convex satisfying the following:

W ((1 - ε) θ^{(i + 1)} + ε θ^{(i)}) \geq \min {W (θ^{(i + 1)}), W (θ^{(i)})}

for all

θ^{(i + 1)}, θ^{(i)} {\in R}^{(\binom{n}{2})}

with

ε \in [0, 1]

. By subtracting the two last inequalities, we obtain:

(f ((1 - ε) θ^{(i + 1)} + ε θ^{(i)}) - W ((1 - ε) θ^{(i + 1)}) + ε θ^{(i)}) \leq (\max {f (θ^{(i + 1)}), f (θ^{(i)})} - \min {W (θ^{(i + 1)}), W (θ^{(i)})})

. Since

f (θ^{(i + 1)}) \leq f (θ^{(i)})

and

W (θ^{(i + 1)}) \geq W (θ^{(i)})

, the right side of the resulting inequality is equal to

f (θ^{(i)}) - W (θ^{(i)})

, which could be set equal to

\max {f (θ^{(i + 1)}) - W (θ^{(i + 1)}), f (θ^{(i)}) - W (θ^{(i)})}

. Finally, by squaring and dividing both sides of the inequality by 2, we get a variation on the Jensen inequality for quasi-convex functions [46] as defined in (20). Hence,

L (θ)

is quasi-convex. □

3.3.2. Time Complexity Analysis

In this subsection, we analyze the time complexity of our coordinate descent-based clustering algorithm presented in Algorithm 2, named CDClustering, which depends on the two subroutines presented in Algorithm 3:

u n i o n

and Algorithm 4:

c l u s t e r

. We note that the superscript i enclosed in round brackets, i.e.,

θ_{p, q}^{(i)}

, is used to indicate the iteration number at which a given variable

θ_{p, q}

has been assigned a value. The proposed algorithm takes as argument the

(\binom{n}{2})

pairwise similarities

X^{T} = [s i m (D_{0}, D_{1}), s i m (D_{0}, D_{2}), \dots, s i m (D_{n - 2}, D_{n - 1})]

and outputs the optimal clustering minimizing our proposed loss function

L (θ)

(14). First, the weight vector

θ^{T}

is initially set equal to

X^{T}

. Afterward, coordinate descent and back-propagation are used to search for the optimal weight vector

θ^{T}

, which minimizes our hinge-based objective

L (θ)

. Through each learning cycle i, one coordinate variable

θ_{p, q}

is popped from a max-heap. Then,

θ_{p, q}

is updated by making the optimal step in the opposite direction to the gradient of

L (θ)

. The weights

θ_{p, q}

(

p = 0, \dots, n - 2, q = p + 1, \dots, n - 1

) attaining the maximum value of 1 will have their corresponding database pairs

(D_{p}, D_{p})

put into the same cluster. By using a max-heap data structure within our coordinate descent algorithm, we optimally choose the current largest variable

θ_{p, q}^{(i)}

at each iteration i such that taking the partial derivative of our loss

L (θ)

with respect to

θ_{p, q}

allows us to attain the next steepest descent minimizing

L (θ)

without using a learning rate. This way, the maximum number of iterations required for our algorithm to converge is less than or equal to

(n^{2} - n) / 2

, i.e., the number of the pairwise similarities. Initially, the number of clusters

f_{θ} (D)

is set equal to the number of transactional databases n. Then, in order to keep track of the database clusters, their number

f_{θ} (D)

and their sizes, we implement a disjoint-set data structure [47], which consists of an array A[

0, \dots, n - 1

] of n integers managed by two main operations: cluster and union. Each cluster

C_{p}

is represented by a tree whose root index p satisfies

A [p] = - 1

, and a database

D_{q}

belonging to the cluster

C_{p}

satisfies

A [q] = p

. Therefore, the cluster function is called recursively to find the label assigned to the database index p (passed in as argument) by moving down the tree towards the root (i.e.,

A [p] = - 1

). On the other hand, the union procedure links two disjoint clusters

C_{p}

and

C_{q}

by making the root of the smaller tree point to the root of the larger one in A[

0, \dots, n - 1

]. The algorithms corresponding to union and cluster are presented in Algorithm 3 and Algorithm 4, respectively. Let

s = (\binom{n}{2})

be the size of the weight vector

θ^{T}

. The time complexity of building the max-heap is

O (s)

and the time complexity of the proposed Algorithm 2: CDClustering is

O (s + h {log}_{2} (n))

, such that

h \in [1, s]

is the number of learning cycles run until global minimum convergence, and

O ({log}_{2} (n))

is the time complexity of one pop operation from the heap. The proposed model is also illustrated in Figure 4. Since it is meaningless to return a single cluster consisting of all the n databases, if the clustering obtained at step (10a) is trivial (i.e., all the n databases are put together in one class or each single database stands alone in its own cluster), then we first need to run the model proposed in Figure 2 on the pairwise similarities to reduce the associated intrinsic fuzziness measured in (9). Afterward, we can apply the proposed model Figure 4 on the new adjusted similarity values to obtain more relevant results.

Algorithm 2: CDClustering

Algorithm 3: union

Algorithm 4: cluster

4. Performance Evaluation

To assess the performance of the proposed clustering algorithm, we carried out numerous experiments on real and synthetic datasets, including Zoo [48], Iris [48], Mushroom [48] and T10I4D100K [49]. To simulate a multi-database environment, we have partitioned each dataset horizontally into n partitions

D_{1}, D_{2}, \dots, D_{n}

, such that

n \in {12, 10, 6, 4}

. Afterward, given a minimum support threshold

α \in {0.5, 0.2, 0.03}

, we run FP-Growth [1] on each partition

D_{i}

(

i = 1, \dots n

) to discover the local frequent itemsets (FIs) corresponding to each partition. All the details related to the partition sizes and their corresponding FIs are shown in Table A1. We note that the fifth column of Table A1 reports the number of FIs discovered in the entire dataset, whereas the most right column of the same table reports the number of FIs aggregated from the local FIs mined from the partitions in each cluster.

The proposed similarity measure

s i m

(3) is called on the

(n^{2} - n) / 2

pairs of FIs to compute the

n \times n

similarity matrices shown in Figure A1a, Figure A2a, Figure A3a, Figure A4a, Figure A5a, Figure A6a, and Figure A7a. Next, using the obtained pairwise similarities, candidate clusterings are produced via the process described in Section 3.1.2, and then evaluated using the clustering quality measures defined in Table 3, including

S C (D)

[43],

g o o d n e s s^{3} (D)

[21],

g o o d n e s s^{2} (D)

[23],

g o o d n e s s (D)

[20] and our proposed loss function

L (θ)

(14). The graphs corresponding to the studied goodness measures are shown in Figure A1b, Figure A2b, Figure A3b, Figure A4b, Figure A5b, Figure A6b, and Figure A7b, where the optimal point (maximum or minimum) of each objective function is depicted as a black dot on its corresponding graph, except that for the graph of our loss function

L (θ)

, there is a red dot representing the value

L (arg min_{\begin{matrix} θ \end{matrix}} L (θ))

(i.e., the optimal point at which our algorithm terminates). It is worth mentioning that due to scale differences, we sometimes multiply or divide our loss function

L (θ)

,

g o o d n e s s^{3} (D)

[21] and

g o o d n e s s^{2} (D)

[23] by a scaling number to stretch or shrink their graphs in the direction of the y-axis. The experimental results depicted in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7 are summarized in Table A2, such that

δ \in [0, 1]

is the ideal similarity threshold for which a goodness measure attains its optimal point. Python version 3.9.2 was used to implement all the algorithms, and the codes were run on a Ubuntu-20.04 server equipped with an Intel(R) Xeon(R) CPU clocked at 2.30 GHz with 50 GB available Disk capacity and 12 GB of available RAM.

4.1. Similarity Accuracy Analysis

To demonstrate the efficiency of

s i m

(3), let us have three transaction databases,

| D_{1} |

= 200,

| D_{2} |

= 300 and

| D_{3} |

= 200 with their corresponding local frequent itemsets:

F I S (D_{1}, 0.2)

=

{〈 C, 0.2 〉, 〈 B, 0.2 〉, 〈 A, 0.2 〉}

,

F I S (D_{2}, 0.15)

=

{〈 E, 0.9 〉, 〈 C, 0.2 〉, 〈 B, 0.2 〉, 〈 A, 0.2 〉}

and

F I S (D_{3}, 0.25)

=

{〈 E, 0.9 〉}

mined at different minimum support threshold values

α_{1} = 0.2

,

α_{2} = 0.15

and

α_{3} = 0.25

, respectively. Now, clustering the three databases using the algorithm BestDatabaseClustering [22] equipped with two different similarity measures,

s i m i

proposed in [20] and our proposed similarity measure

s i m

(3), shows the results reported in Table 4. We note that

g o o d n e s s

[20] is a clustering quality measure, such that the higher the value of

g o o d n e s s

for a given candidate clustering

C

, the better the quality of

C

.

From Table 4, we notice that using our similarity measure

s i m

(3), we have obtained a larger intra-cluster similarity, a larger inter-cluster distance and a larger

g o o d n e s s

[20]. Now, let us synthesize the global frequent itemsets from the clusters containing more than one database, i.e.,

C_{2, 3} = {D_{2}, D_{3}}

and

C_{1, 2} = {D_{1}, D_{2}}

. The obtained results are shown in Table 5, such that

α_{2, 3} = \frac{300 \times 0.15 + 200 \times 0.25}{300 + 200} = 0.19

and

α_{1, 2} = \frac{200 \times 0.2 + 300 \times 0.15}{200 + 300} = 0.17

are the minimum support thresholds corresponding to

C_{2, 3}

and

C_{2, 3}

respectively. As we can see, the similarity measure

s i m i

[20] captures only high frequency itemsets (

s u p p \approx 1

), such as E, and neglects low support frequent itemsets (i.e., whose supports are immediately above the minimum threshold

α

with

s u p p \in [α, α + ϵ]

and

ϵ

is a very small number), such as A, B and C. This characteristic gives a high similarity value to database pairs sharing only one or very few high frequency itemsets. On the other hand, database pairs sharing many frequent itemsets with a low support will be assigned a lower similarity. However, once the clustering is done, we will be interested in the patterns discovered from each cluster individually, such as the high-vote patterns [32] and the exceptional patterns [31]. That is why our similarity measure estimates the patterns post-mined from each cluster

C_{p, q} = {D_{p}, D_{q}}

in order to compute

s i m (D_{p}, D_{q})

. Since our similarity measure focuses on maximizing the number of frequent itemsets synthesized from each cluster

C_{p, q} \subseteq D

, only relevant clusters will be assigned a large similarity value.

4.2. Fuzziness Reduction Analysis

To demonstrate the importance of reducing the fuzziness associated with a similarity matrix, we run the clustering algorithm BestDatabaseClustering [22] on two similarity matrices in Figure 5a and Figure 6a. The obtained results in terms of the optimal clustering,

max g o o d n e s s (D)

[20], the optimal similarity level

δ_{o p t}

(i.e., the similarity level at

max g o o d n e s s (D)

) and the silhouette coefficient

S C (D)

[43] at

δ_{o p t}

are shown in Figure 5b,c and Figure 6b,c corresponding to rows 1 and 2 of Table 6, respectively. From the obtained results, we can clearly see that when the similarity matrices are centered around the mean value 0.5, the fuzziness index becomes larger and closer to 1, and BestDatabaseClustering [22] could not return a meaningful clustering since it has put all the n databases into the same cluster.

Now, let us run our fuzziness reduction model on the previous similarity matrices and depict the adjusted similarity matrices in Figure 7a and Figure 8a, respectively. Afterward, we run BestDatabaseClustering [22] on the new similarity matrices and show the clustering results in Figure 7b,c and Figure 8b,c corresponding to rows 3 and 4 of Table 6, respectively. As we can see, after reducing the fuzziness index associated with the previous similarity matrices in Figure 5a and Figure 6a, the algorithm BestDatabaseClustering [22] was able to produce meaningful non-trivial clusterings with an increase in the silhouette coefficient

S C (D)

[43] for both similarity matrices in Figure 7a and Figure 8a.

4.3. Convexity and Clustering Analysis

In this part of our experiments, we analyze the convex behavior of the proposed clustering quality functions

L (θ)

(19) and

L (θ)

(14), and we also examine the non-convexity of the existing goodness measures in [20,21,23,43]. Additionally, we compare the clustering produced by our algorithm and the ones generated at the optimal points of the previous compared goodness measures (i.e., at

max g o o d n e s s (D)

[20],

min g o o d n e s s^{2} (D)

[23] and

max g o o d n e s s^{3} (D)

[21]) with the underlying ground-truth cluster labels. When the actual clustering is unknown, we replace it with the partitioning obtained at the maximum value of the silhouette coefficient [43], that is, at

max S C (D)

. All the graphs corresponding to our loss functions and the compared goodness measures in Table 3 are plotted in Figure A1b, Figure A2b, Figure A3b, Figure A4b, Figure A5b, Figure A6b, and Figure A7b, where the x-axis represents the similarity levels

δ

at which multiple candidate clusterings are generated and evaluated.

Consider the

7 \times 7

similarity matrix shown in Figure A1a. From the graphs plotted in Figure A1b and according to the results shown in the first row of Table A2, we can see that using our loss function

L (θ)

and

g o o d n e s s (D)

[20], we were able to find the optimal clustering

{C_{1} = {D_{3}, D_{2}, D_{1}}, C_{2} = {D_{4}}, C_{3} = {D_{7}, D_{6}, D_{5}}}

at a similarity level

δ = 0.44

where the silhouette coefficient reaches its maximum value

S C (D) = 0.46

. On the other hand,

g o o d n e s s^{3} (D)

[21] and

g o o d n e s s^{2} (D)

[23] did not successfully discover the partitioning maximizing the silhouette coefficient. Additionally, we observe that the proposed convergence test function

L (θ)

has a quasi-convex behavior (see proof of Theorem 2). This allows us to terminate the clustering process right after reaching the global minimum. Conversely, the graphs corresponding to

g o o d n e s s^{2} (D)

[23] and

g o o d n e s s (D)

[20] have local optima. Consequently, it is required to explore about

(n^{2} - n) / 2

similarity levels in order to generate and evaluate all the candidate clusterings possible.

Now, let us examine the results of some experiments that we have conducted on the synthetic and real-world datasets shown in Table A1. From Figure A2b and Figure A7b (the last and second rows of Table A2), we observe that

g o o d n e s s^{3} (D)

[21] and

g o o d n e s s^{2} (D)

[23] attain their optimal values when all the partition databases are clustered together in one class. The same phenomenon is observed in Figure A3b, Figure A6b and Figure A7b (the last, the sixth and the third rows of Table A2), where both

g o o d n e s s^{2} (D)

[23] and

g o o d n e s s (D)

[20] have put all the databases into one cluster.

In contrast, the proposed loss function

L (θ)

has successfully identified the clustering for which the silhouette coefficient

S C

is maximum. Precisely, in Figure A7b, which corresponds to the last row of Table A2),

L (θ)

was the only clustering quality measure which has properly identified the ideal 7-class clustering at

δ = 0.846

.

From the obtained graphs in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7, we notice that

g o o d n e s s^{3} (D)

[21],

g o o d n e s s^{2} (D)

[23] and

g o o d n e s s (D)

[20] are neither quasi-concave nor quasi-convex on the domain

[0, 1]

. As a result, we observe the existence of local optimum points on their corresponding graphs, which makes the search of the global optimum a difficult problem to solve without exploring all the local solutions.

Conversely, we observe that our loss function

L (θ)

(14) is monotonically decreasing all the time and

L (\hat{θ}) = 0

at

\hat{θ} = arg min_{\begin{matrix} θ \end{matrix}} L (θ) = \vec{1}

. This corresponds to the similarity level

δ = 0

where all the n databases are put into the same single cluster. To avoid this case from occurring, we used the quasi-convex function

L (θ)

(19) as a convergence test function to terminate our algorithm at the point

L (arg min_{\begin{matrix} θ \end{matrix}} L (θ))

corresponding to the red dot on the graph of our loss function

L (θ)

. Moreover, it is worth noting that for every two real

(\binom{n}{2})

-dimensional vectors

θ^{(i)}

and

θ^{(i + 1)}

, where

L (θ^{(i + 1)}) \leq L (θ^{(i)})

, the line that joins the points (

θ^{(i + 1)}, L (θ^{(i)})

) and (

θ^{(i)}, L (θ^{(i)})

) remains above

L (θ)

, which is observed in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7. Therefore, using the proposed loss function

L (θ)

(14) along with

L (θ)

(19) guarantees global minimum convergence.

In the fifth and the most right columns of Table A1, we compare the number of frequent itemsets (FIs) mined from all the partitions of a given dataset

D

with the FIs mined from each single cluster

C_{j}

consisting of similar partitions from the same dataset, where

\cap_{j = 1}^{k} {C_{j}} = \emptyset

and

\cup_{j = 1}^{k} {C_{j}} = D

. We notice that mining all the partitions from datasets Iris [48] and Zoo [48] did not result in discovering any valid frequent itemset. Whereas, mining each individual cluster of partitions from the datasets Iris and Zoo has led to the discovery of new patterns in each cluster

C_{j}

.

In Table A3, we report the similarity levels

δ_{o p t}

at which the clustering evaluation measures

g o o d n e s s (D)

[20],

g o o d n e s s^{2} (D)

[23],

g o o d n e s s^{3} (D)

[21], the silhouette coefficient

S C

[43] and our proposed loss function

L (θ)

attain their optimal values in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7. We note that, the fraction

\frac{| {δ_{1}, \dots, δ_{s t o p}} |}{| {δ_{1}, \dots, δ_{m}} |}

in Table A3 represents the number of similarity levels required to test the convergence and terminate divided by the number of all similarity levels m. We note that

o p t

is the index of the optimal similarity level according to a given clustering quality measure. Since our proposed algorithm is based on a convex loss function, we notice that

s t o p = o p t < m

. On the other hand, as for the compared algorithms, which are based on non-convex objectives, we notice that

s t o p = m

. Therefore, our algorithm requires the least number of similarity levels (

o p t

out of m) in order to converge and terminate, which makes our algorithm faster than the compared algorithms in [21,22,23], requiring to generate and evaluate all the m candidate clusterings in order to return the optimal one.

All the previous results confirm that using our loss function

L (θ)

(14) along with

L (θ)

(19), we have identified the ideal clustering for which the silhouette coefficient

S C

[43] is maximum and we have also improved the quality of the frequent itemsets (FIs) mined from the multiple databases partitioned from the datasets in Table A1.

4.4. Clustering Error and Running Time Analysis

In this experimental part, we compare the running time of the proposed clustering algorithm with the execution times of two clustering algorithms for multi-database mining (MDM), namely GDMDBClustering [25] and BestDatabaseClustering [22], all run on the same random data samples. We also calculate how the clusterings produced by our algorithm and the compared models are different from the ground-truth clustering. For this purpose, we propose an error function in (21), which measures the difference between two given clusterings

P

and

Q

.

First, we generated n = 30 to N = 120 isotropic Gaussian blobs using the scikit-learn generator [50], such that the number of features for each n blobs is set to

r a n d o m . r a n d i n t (2, 10)

, while the number of clusters is set to

⌊ \frac{n}{2} ⌋

. In Table 7, we present a brief summary of the random blobs generated via scikit-learn [50].

Afterward, we use the min-max scaling [51] to normalize each feature (out of the m features) into the interval [0,1]. Then, for each n blobs, every pair of m-dimensional blobs is passed as arguments to the function

s i m

(3) in order to compute the

(n^{2} - n) / 2

pairwise similarities between the n blobs. We then run the proposed algorithm, GDMDBClustering [25] (with three different learning rate values) and BestDatabaseClustering [22] on each of the

(n^{2} - n) / 2

pairwise similarities (

n = 30, \dots, 120

) and plot their running time graphs in Figure A8a, Figure A9a and Figure A10a, and then plot the clustering error graphs in Figure A8b, Figure A9b and Figure A10b.

Without loss of generality, assume

Q

is the ground-truth clustering (i.e., the actual clusters) of the current n blobs

D = {D_{1}, D_{2}, \dots, D_{n}}

generated via scikit-learn [50], and assume

P

is the partitioning of

D

produced by a given clustering algorithm. To measure how far is

P

from

Q

, we propose the error function

E_{n} (P, Q) \in [0, 1]

to be defined as follows:

E_{n} (P, Q) = \frac{| P a i r s_{Q} \ P a i r s_{P} | + | P a i r s_{P} \ P a i r s_{Q} |}{| P a i r s_{Q} | + | P a i r s_{P} |}

(21)

where

| P a i r s_{P} |

is the number of all the database pairs obtained from every cluster in

P

and

| P a i r s_{P} \ P a i r s_{Q} |

is the number of all the database pairs that only exist in

P a i r s_{P}

and that cannot be found in

P a i r s_{Q}

. We note that

E_{n} (P, Q)

approaches the maximum value of 1 (i.e.,

E_{n} (P, Q) \approx 1

) when

P

and

Q

are too different and do not share many database pairs in common (i.e.,

| P a i r s_{P} \cap P a i r s_{Q} | \approx 0

). Conversely,

E_{n} (P, Q) \approx 0

when the clustering

P

and

Q

are too similar, i.e., they share the maximum number of pairs

(D_{p}, D_{q})

.

We also define the average of the

N - n + 1

clustering errors, which could also be seen as the mean absolute clustering error:

\bar{E (P, Q)} = \frac{\sum_{n}^{N} E_{n} (P, Q)}{N - n + 1}

(22)

From the obtained results in Figure A8a, Figure A9a and Figure A10a, we observe a rapid increase in the running time of BestDatabaseClustering [22] as the number of generated blobs (n) increases linearly. This is due to the fact that BestDatabaseClustering needs to generate and evaluate approximately

(n^{2} - n) / 2

candidate clusterings in order to find the optimal clustering for which the non-convex function

g o o d n e s s (D)

[20] is maximum. In fact,

g o o d n e s s (D)

suffers from the existence of local maxima, which requires exploring all the local candidate solutions in order to find the global maximum. On the other hand, using the proposed convex loss function

L (θ)

and the quasi-convex convergence test function

L (θ)

allows us to stop the clustering process at

L (arg min_{\begin{matrix} θ \end{matrix}} L (θ))

. Consequently, this avoids generating unnecessary candidate clusterings, and hence reduces the CPU overhead. Since our algorithm is independent of the learning rate

η

, the running time of our algorithm is the same in all Figure A8a, Figure A9a and Figure A10a. Whereas, the running time of GDMDBClustering [25] increases for smaller learning rate values (e.g., Figure A10) and decreases when we set larger learning rate values (e.g., Figure A9), but this comes at the cost of having an increased clustering error.

Next, by examining the three clustering error graphs in Figure A8b, Figure A9b and Figure A10b, we observe that BestDatabaseClustering [22] has the largest clustering error among the three algorithms with a clustering average error

\bar{E (P, Q)} = 0.936

. In fact, on average, BestDatabaseClustering [22] tends to group all the current n blobs (

n = 30, \dots, 120

) in one single cluster. On the other hand, our proposed algorithm and GDMDBClustering [25] produce clusterings that are close to the ground-truth clustering predetermined by the scikit-learn generator [50]. In fact, the average clustering error due to our algorithm is

\bar{E (P, Q)} = 0.285

. For GDMDBClustering [25], we get

\bar{E (P, Q)} = 0.285

when the learning rate

η = 0.0005

or

η = 0.001

, and the error increases to

\bar{E (P, Q)} = 0.29

when

η = 0.002

. The average running times and clustering errors of our algorithm, GDMDBClustering [25] and BestDatabaseClustering [22] are summarized in Table A4.

Our algorithm and GDMDBClustering [25] terminate once we reach the global minimum of the convergence test function

L (θ)

. Consequently, the running times of our algorithm and GDMDBClustering [25] are most of the time shorter than that of BestDatabaseClustering [22]. Overall, the running time of GDMDBClustering [25] stays relatively steady with respect to n. However, GDMDBClustering depends strongly on the learning step size

η

and its decay rate. On the other hand, our algorithm is learning-rate-free and needs at most (in the worst case)

(n^{2} - n) / 2

iterations to converge. Consequently, our proposed algorithm is faster than both BestDatabaseClustering [22] and GDMDBClustering [25].

To illustrate the statistically significant superiority of the proposed clustering model in terms of running time and clustering accuracy, we have applied the Friedman test [52] (under a significance level

α = 0.05

) on the measurements (execution times and clustering errors depicted in Figure A8, Figure A9 and Figure A10) obtained by our algorithm, BestDatabaseClustering [22] and GDMDBClustering [25] (with three different values for the learning rate

η

) considering all the random samples in Table 7.

After conducting the Friedman test [52], we obtained the results shown in Table A5, Table A6 and Table A7, namely the average running time, the average clustering error

\bar{E (P, Q)}

(22), the standard deviation (SD), the variance (Var), the critical value (stat) and its p-value for all the tested clustering algorithms, considering all 91 random samples generated via scikit-learn [50].

We notice that all the obtained results in Table A5, Table A6 and Table A7 show p-values that are below the significance level

α = 0.05

. Consequently, the test suggests a rejection of the null hypothesis, stating that the compared clustering models have a similar performance. In fact, the proposed clustering algorithm significantly outperforms the other compared models, as it has the shortest average running time (6.367 milliseconds) and the lowest average clustering error (

\bar{E (P, Q)} = 0.285

) among all the compared models.

4.5. Clustering Comparison and Assessment

In the third part of our experiments, we are interested in using some information retrieval measures to compare the clusterings produced by our algorithm and some other clustering algorithms with the ground-truth data.

Let

D = {D_{1}, D_{2}, \dots, D_{n}}

be n transactional databases. Let

P = {P_{1}, P_{2}, \dots, P_{k}}

be a k-class clustering of

D

produced by any given clustering algorithm, and let

Q = {Q_{1}, Q_{2}, \dots, Q_{l}}

be the ground-truth clustering of the databases in

D

, such that

\cup_{i = 1}^{k} {P_{i}} = \cup_{i = 1}^{l} {Q_{i}} = D

and

\cap_{i = 1}^{k} {P_{i}} = \cap_{i = 1}^{l} {Q_{i}} = \emptyset

. Let us define

P a i r s_{P}

and

P a i r s_{Q}

as the set of database pairs obtained from each cluster of the same clustering. That is,

P a i r s_{P} = \cup_{P_{t} \in P} \cup_{D_{r}, D_{s} \in P_{t}; r < s} {(D_{r}, D_{s})}

and

P a i r s_{Q} = \cup_{Q_{t} \in Q} \cup_{D_{r}, D_{s} \in Q_{t}; r < s} {(D_{r}, D_{s})}

. To compare the two clusterings

P

with

Q

, few methods method [53,54,55] could be used. In this paper, we use a pair counting [56,57,58,59] to calculate some information retrieval measures [60,61], including precision, recall, F-measure (i.e., harmonic mean of recall and precision), Rand index [62] and Jaccard index [63] over pairs of databases being clustered together in

P

and/or

Q

. This will allow us to assess whether the predicted database pairs from

P

cluster together in

Q

, i.e., are the discovered database pairs in

P a i r s_{P}

correct with respect to the underlying true pairs in

P a i r s_{Q}

from the ground-truth clustering

Q

.

In Table A9, we show the categories of database pairs which represent the working set of all pair counting measures cited in Table A10. Precisely, a: represents the number of pairs that exist in both clusterings

Q

and

P

, d: represents the number of pairs that do not exist in either clustering, b: is the number of pairs present only in clustering

Q

, and c: is the number of pairs present only in clustering

P

. By counting the pairs in each category, we get an indicator for agreement and disagreement of the two clusterings being compared. The following example illustrates how to compute the measures defined in Table A10 for two given clusterings

P = {{D_{1}, D_{2}, D_{3}}, {D_{4}, D_{5}, D_{6}, D_{7}}}

and the ground-truth partitioning

Q = {{D_{1}, D_{2}, D_{3}}, {D_{4}}, {D_{5}, D_{6}, D_{7}}}

of seven transaction databases

D = \cup_{i = 1}^{7} {D_{i}}

. First, let us calculate the following pairing categories:

\begin{matrix} P a i r s_{D} = & \cup_{r = 1}^{6} \cup_{s = r + 1}^{7} {(D_{r}, D_{s})}, \\ P a i r s_{Q} = & {(D_{6}, D_{7}), (D_{5}, D_{7}), (D_{5}, D_{6}), (D_{2}, D_{3}), (D_{1}, D_{3}), (D_{1}, D_{2})}, \\ P a i r s_{P} = & {(D_{6}, D_{7}), (D_{5}, D_{7}), (D_{5}, D_{6}), (D_{4}, D_{7}), (D_{4}, D_{6}), (D_{4}, D_{5}), (D_{2}, D_{3}), (D_{1}, D_{3}), \\ (D_{1}, D_{2})} . \end{matrix}

Then,

a = | P a i r s_{Q} \cap P a i r s_{P} | = 6

,

b = | P a i r s_{Q} \ P a i r s_{P} | = 0

,

c = | P a i r s_{P} \ P a i r s_{Q} | = 3

,

d = | P a i r s_{D} \ (P a i r s_{Q} \cup P a i r s_{P}) | = 12

. Therefore, we get the following measures:

F - m e a s u r e = 0.8

,

p r e c i s i o n = 0.66

,

r e c a l l = 1.0

,

R a n d = 0.857

,

J a c c a r d = 0.66

. We note that the higher the values of the evaluation measures given in Table A10, the better the matching of the clustering

P

to its corresponding ground-truth clustering

Q

.

In Table A8 and Table A11, we report the F-measure [60,61], precision [60,61], recall [60,61], Rand [62] and Jaccard [63] reached by the clustering algorithms in [21,22,23], and our proposed algorithm on the datasets shown in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7. From Table A8 and Table A11, we notice that our algorithm achieves the best scores against the compared clustering algorithms, considering all the experiments in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7.

5. Conclusions

An improved similarity-based clustering algorithm for multi-database mining was proposed in this paper. Unlike the previous works, our algorithm requires fewer upper-bounded iterations to minimize a convex clustering quality measure. In addition, we have proposed a preprocessing layer prior to clustering where the pairwise similarities between multiple databases are first adjusted to reduce their fuzziness. This will help the clustering process to be more precise and less confused in discriminating between the different database clusters. To assess the performance of our algorithm, we conducted several experiments on real and synthetic datasets. Compared to the existing clustering algorithms for multi-database mining, our algorithm achieved the best performance in terms of accuracy and running time. In this paper, we have used the most frequent itemsets mined from each transaction database as feature sets to compute the pairwise similarities between the multiple databases. However, when the sizes of these input vectors become large, building the similarity matrix will increase the CPU overhead drastically. Moreover, the existence of some noisy frequent itemsets (FIs) may largely influence how databases are clustered together. In future work, we will investigate the impact of compressing the size of the FIs into a latent variable represented in a lower dimensional space with discriminative features. Practically, reconstituting the input vectors from the embedding space using deep auto-encoders and non-linear dimensionality reduction techniques, such as T-SNE (t-distributed stochastic neighbor embedding) and UMAP (uniform manifold approximation and projection), will force the removal of the noisy features present in the input data while keeping only the meaningful discriminative ones. Consequently, this may help improve the accuracy and running time of the clustering algorithm. Additionally, we are interested in exploring new ways to reduce the computational time used to calculate the similarity matrix via locality sensitive hashing (LSH) techniques, such as BagMinHash for weighted sets. These methods aim to encode the feature-set vectors into hash-code signatures in order to efficiently estimate the Jaccard similarity between the local transactional databases. Last but not least, in order to design a parallel version of the proposed algorithm, we will study and explore some high-performance computing tools, such as MapReduce and Spark, as an attempt to improve the clustering performance for multi-database mining.

Author Contributions

Conceptualization, S.M.; methodology, S.M.; software, S.M.; validation, S.M.; formal analysis, S.M.; investigation, S.M.; resources, S.M.; data curation, S.M.; writing—original draft preparation, S.M.; writing—review and editing, S.M.; visualization, S.M., W.D.; supervision, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All datasets are available at http://fimi.ua.ac.be/data/ (accessed on 25 April 2021) and https://archive.ics.uci.edu/ml/datasets (accessed on 25 April 2021).

Acknowledgments

We would like to thank the anonymous reviewers for their time and their valuable comments.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FIs	Frequent Itemsets
FIM	Frequent Itemset Mining
MDB	Multiple Databases
MDM	Multi-database Mining
CD	Coordinate Descent
CL	Competitive Learning
BMU	Best Matching Unit
T-SNE	t-Distributed Stochastic Neighbor Embedding
UMAP	Uniform Manifold Approximation and Projection
LSH	Locality Sensitive Hashing

Appendix A

Table A1. Description of the datasets used in our experiments with their database partitions and the number frequent itemsets (FIs) mined from each database partition

D_{i}

and the number of FIs mined from each cluster

C_{j}

under a threshold

α

.

Table A1. Description of the datasets used in our experiments with their database partitions and the number frequent itemsets (FIs) mined from each database partition

D_{i}

and the number of FIs mined from each cluster

C_{j}

under a threshold

α

.

Dataset Name/Ref	Number of Rows	Number of Rows in Partition $D_{i}$	Number of $FIS (D_{i}, α)$ from Partition $D_{i}$	Number of $FIS (D, α)$ from Dataset $D$	Ground Truth Clustering	Number of $FIS (C_{j}, α)$ from Cluster $C_{j}$
Mushroom [48] (2 classes)	8124	$\| D_{1} \| = 3916$ ( $C_{1}$ ) $\| D_{2} \| = 1402$ ( $C_{2}$ ) $\| D_{3} \| = 1402$ ( $C_{2}$ ) $\| D_{4} \| = 1404$ ( $C_{2}$ )	$\| F I S (D_{1}, 0.5) \| = 375$ $\| F I S (D_{2}, 0.5) \| = 2063$ $\| F I S (D_{3}, 0.5) \| = 32,911$ $\| F I S (D_{4}, 0.5) \| = 807$	$\| F I S (D, 0.5) \| = 151$	$C_{1} = {D_{1}},$ $C_{2} = {D_{4}, D_{3}, D_{2}}$	$\| F I S (C_{1}, 0.5) \| = 375$ $\| F I S (C_{2}, 0.5) \| = 1441$
Zoo [48] (7 classes)	101	$\| D_{1} \| = 20$ ( $C_{1}$ ) $\| D_{2} \| = 21$ ( $C_{1}$ ) $\| D_{3} \| = 10$ ( $C_{2}$ ) $\| D_{4} \| = 10$ ( $C_{2}$ ) $\| D_{5} \| = 5$ ( $C_{3}$ ) $\| D_{6} \| = 6$ ( $C_{4}$ ) $\| D_{7} \| = 7$ ( $C_{4}$ ) $\| D_{8} \| = 2$ ( $C_{5}$ ) $\| D_{9} \| = 2$ ( $C_{5}$ ) $\| D_{10} \| = 4$ ( $C_{6}$ ) $\| D_{11} \| = 4$ ( $C_{6}$ ) $\| D_{12} \| = 10$ ( $C_{7}$ )	$\| F I S (D_{1}, 0.5) \| = 24,383$ $\| F I S (D_{2}, 0.5 \| = 30,975$ $\| F I S (D_{3}, 0.5) \| = 30,719$ $\| F I S (D_{4}, 0.5) \| = 32,767$ $\| F I S (D_{5}, 0.5) \| = 20,479$ $\| F I S (D_{6}, 0.5 \| = 65,535$ $\| F I S (D_{7}, 0.5) \| = 65,535$ $\| F I S (D_{8}, 0.5) \| = 114,687$ $\| F I S (D_{9}, 0.5) \| = 98,303$ $\| F I S (D_{10}, 0.5) \| = 53,247$ $\| F I S (D_{11}, 0.5) \| = 57,343$ $\| F I S (D_{12}, 0.5) \| = 28,671$	$\| F I S (D, 0.5) \| = 0$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{4}, D_{3}},$ $C_{3} = {D_{5}},$ $C_{4} = {D_{7}, D_{6}},$ $C_{5} = {D_{9}, D_{8}},$ $C_{6} = {D_{11}, D_{10}},$ $C_{7} = {D_{12}},$	$\| F I S (C_{1}, 0.5) \| = 25,087$ $\| F I S (C_{2}, 0.5) \| = 28,671$ $\| F I S (C_{3}, 0.5) \| = 2479$ $\| F I S (C_{4}, 0.5) \| = 49,151$ $\| F I S (C_{5}, 0.5) \| = 57,343$ $\| F I S (C_{6}, 0.5) \| = 45,055$ $\| F I S (C_{7}, 0.5) \| = 28,671$
Iris [48] (3 classes)	150	$\| D_{1} \| = 25$ ( $C_{1}$ ) $\| D_{2} \| = 25$ ( $C_{1}$ ) $\| D_{3} \| = 25$ ( $C_{2}$ ) $\| D_{4} \| = 25$ ( $C_{2}$ ) $\| D_{5} \| = 25$ ( $C_{3}$ ) $\| D_{6} \| = 25$ ( $C_{3}$ )	$\| F I S (D_{1}, 0.2) \| = 5$ $\| F I S (D_{2}, 0.2) \| = 6$ $\| F I S (D_{3}, 0.2) \| = 2$ $\| F I S (D_{4}, 0.2) \| = 2$ $\| F I S (D_{5}, 0.2) \| = 2$ $\| F I S (D_{6}, 0.2) \| = 5$	$\| F I S (D, 0.2) \| = 0$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{4}, D_{3}},$ $C_{3} = {D_{6}, D_{5}}$	$\| F I S (C_{1}, 0.2) \| = 3$ $\| F I S (C_{2}, 0.2) \| = 1$ $\| F I S (C_{3}, 0.2) \| = 2$
T10I4D100K [49] (unknown classes)	100,000	$\| D_{i} \| = 10,000$ rows, $i = 1 \dots 10$	$\| F I S (D_{1}, 0.03) \| = 58$ $\| F I S (D_{2}, 0.03) \| = 58$ $\| F I S (D_{3}, 0.03) \| = 62$ $\| F I S (D_{4}, 0.03) \| = 57$ $\| F I S (D_{5}, 0.03) \| = 62$ $\| F I S (D_{6}, 0.03) \| = 63$ $\| F I S (D_{7}, 0.03) \| = 63$ $\| F I S (D_{8}, 0.03) \| = 59$ $\| F I S (D_{9}, 0.03) \| = 61$ $\| F I S (D_{10}, 0.03) \| = 62$	$\| F I S (D, 0.03) \| = 50$	Seven clusters found via the silhouette coefficient [43] $C_{1} = {D_{1}},$ $C_{2} = {D_{2}},$ $C_{3} = {D_{3}},$ $C_{4} = {D_{5}, D_{4}},$ $C_{5} = {D_{6}},$ $C_{6} = {D_{7}},$ $C_{7} = {D_{10}, D_{9}, D_{8}}$	$\| F I S (C_{1}, 0.03) \| = 58$ $\| F I S (C_{2}, 0.03) \| = 58$ $\| F I S (C_{3}, 0.03) \| = 62$ $\| F I S (C_{4}, 0.03) \| = 59$ $\| F I S (C_{5}, 0.03) \| = 63$ $\| F I S (C_{6}, 0.03) \| = 59$ $\| F I S (C_{7}, 0.03) \| = 61$
Figure A1 [20] (unknown classes)	24	$\| D_{1} \| = 3$ $\| D_{2} \| = 3$ $\| D_{3} \| = 3$ $\| D_{4} \| = 4$ $\| D_{5} \| = 4$ $\| D_{6} \| = 3$ $\| D_{7} \| = 4$	$\| F I S (D_{1}, 0.42) \| = 3$ $\| F I S (D_{2}, 0.42) \| = 3$ $\| F I S (D_{3}, 0.42) \| = 5$ $\| F I S (D_{4}, 0.42) \| = 7$ $\| F I S (D_{5}, 0.42) \| = 7$ $\| F I S (D_{6}, 0.42) \| = 5$ $\| F I S (D_{7}, 0.42) \| = 3$	$\| F I S (D, 0.42) \| = 0$	Three clusters found via the silhouette coefficient [43] $C_{1} = {D_{3}, D_{2}, D_{1}}$ $C_{2} = {D_{4}}$ $C_{3} = {D_{7}, D_{6}, D_{5}}$	$\| F I S (C_{1}, 0.42) \| = 3$ $\| F I S (C_{2}, 0.42) \| = 7$ $\| F I S (C_{3}, 0.42) \| = 3$

Table A2. Clustering results illustrated in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7 after using the clustering quality measures

{goodness}^{3} (D)

[21],

{goodness}^{2} (D)

[23], the silhouette coefficient

S C (D)

[43],

goodness (D)

[20] and our proposed objective function

L (θ)

, where

δ

is the level of similarity at which each clustering evaluation/loss function reaches its optimal value.

Table A2. Clustering results illustrated in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7 after using the clustering quality measures

{goodness}^{3} (D)

[21],

{goodness}^{2} (D)

[23], the silhouette coefficient

S C (D)

[43],

goodness (D)

[20] and our proposed objective function

L (θ)

, where

δ

is the level of similarity at which each clustering evaluation/loss function reaches its optimal value.

Dataset Name/Ref	Silhouette Coefficient	Clustering Result Using Proposed Objective		Clustering Result/ Optimal Value		Clustering Result/ Optimal Value		Clustering Result/ Optimal Value
Dataset Name/Ref	$\max$ $SC (D)$	Clusters	$L (arg min_{\begin{matrix} θ \end{matrix}} L (θ))$	Clusters	$\max$ $goodness (D)$	Clusters	$\min$ ${goodness}^{2} (D)$	Clusters	$\max$ ${goodness}^{3} (D)$
Figure A1 $7 \times 7$ [20]	0.46 at $δ = 0.444$	$C_{1} = {D_{3}, D_{2}, D_{1}},$ $C_{2} = {D_{4}},$ $C_{3} = {D_{7}, D_{6}, D_{5}}$	2.004 at $δ = 0.444$	$C_{1} = {D_{3}, D_{2}, D_{1}},$ $C_{2} = {D_{4}},$ $C_{3} = {D_{7}, D_{6}, D_{5}}$	15.407 at $δ = 0.444$	$C_{1} = {D_{7}, \dots, D_{1}}$	0.259 at $δ = 0.065$	$C_{1} = {D_{3}, D_{2}, D_{1}},$ $C_{2} = {D_{7}, \dots, D_{4}}$	0.728 at $δ = 0.086$
Figure A2 $12 \times 12$ $Z o o$ [48]	0.41 at $δ = 0.559$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{4}, D_{3}},$ $C_{3} = {D_{5}},$ $C_{4} = {D_{7}, D_{6}},$ $C_{5} = {D_{9}, D_{8}},$ $C_{6} = {D_{11}, D_{10}},$ $C_{7} = {D_{12}}$	7.71 at $δ = 0.559$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{4}, D_{3}},$ $C_{3} = {D_{5}},$ $C_{4} = {D_{7}, D_{6}},$ $C_{5} = {D_{9}, D_{8}},$ $C_{6} = {D_{11}, D_{10}},$ $C_{7} = {D_{12}}$	32.98 at $δ = 0.559$	$C_{1} = {D_{12}, \dots, D_{1}}$	0.57 at $δ = 0.348$	$C_{1} = {D_{12}, \dots, D_{1}}$	0.42 at $δ = 0.348$
Figure A3 $4 \times 4$ $M u s h r o o m$ [48]	0.08 at $δ = 0.41$	$C_{1} = {D_{1}},$ $C_{2} = {D_{4}, D_{3}, D_{2}}$	0.43 at $δ = 0.41$	$C_{1} = {D_{4}, \dots, D_{1}}$	1.672 at $δ = 0.365$	$C_{1} = {D_{4}, \dots, D_{1}}$	0.55 at $δ = 0.365$	$C_{1} = {D_{1}},$ $C_{2} = {D_{4}, D_{3}, D_{2}}$	0.68 at $δ = 0.41$
Figure A4 $6 \times 6$ $I r i s$ [48]	0.304 at $δ = 0.3$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{4}, D_{3}},$ $C_{3} = {D_{6}, D_{5}}$	1.10 at $δ = 0.3$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{4}, D_{3}},$ $C_{3} = {D_{6}, D_{5}}$	9.64 at $δ = 0.3$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{4}, D_{3}},$ $C_{3} = {D_{6}, D_{5}}$	0.55 at $δ = 0.3$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{4}, D_{3}},$ $C_{3} = {D_{6}, D_{5}}$	0.44 at $δ = 0.3$
Figure A5 $6 \times 6$ $Z o o$ & $M u s h r o o m$ [48]	0.63 at $δ = 0.384$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{6}, \dots, D_{3}}$	0.5 at $δ = 0.384$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{6}, \dots, D_{3}}$	9.96 at $δ = 0.384$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{6}, \dots, D_{3}}$	0.40 at $δ = 0.384$	$C_{1} = {D_{2}, D_{1}},$ $C_{2} = {D_{6}, \dots, D_{3}}$	0.85 at $δ = 0.384$
Figure A6 $4 \times 4$ [39]	0.34 at $δ = 0.429$	$C_{1} = {D_{3}, D_{2}, D_{1}},$ $C_{2} = {D_{4}}$	1.12 at $δ = 0.429$	$C_{1} = {D_{4}, \dots, D_{1}}$	2.708 at $δ = 0.25$	$C_{1} = {D_{4}, \dots, D_{1}}$	0.38 at $δ = 0.25$	$C_{1} = {D_{3}, D_{2}, D_{1}},$ $C_{2} = {D_{4}}$	0.81 at $δ = 0.429$
Figure A7 $10 \times 10$ T10I4D100K [49]	0.115 at $δ = 0.846$	$C_{1} = {D_{1}},$ $C_{2} = {D_{2}},$ $C_{3} = {D_{3}},$ $C_{4} = {D_{5}, D_{4}},$ $C_{5} = {D_{6}},$ $C_{6} = {D_{7}},$ $C_{7} = {D_{10}, D_{9}, D_{8}}$	0.71 at $δ = 0.846$	$C_{1} = {D_{10}, \dots, D_{1}}$	35.275 at $δ = 0.737$	$C_{1} = {D_{10}, \dots, D_{1}}$	0.193 at $δ = 0.737$	$C_{1} = {D_{10}, \dots, D_{1}}$	0.806 at $δ = 0.737$

Figure A1. (a): represents a similarity matrix between 7 databases from [25]. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 7

) under a threshold

α = 0.42

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ)

, our proposed loss function

L (θ) \times 5

and the number of generated clusters.

Figure A1. (a): represents a similarity matrix between 7 databases from [25]. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 7

) under a threshold

α = 0.42

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ)

, our proposed loss function

L (θ) \times 5

and the number of generated clusters.

Figure A2. (a): represents a similarity matrix between 12 databases partitioned from Zoo dataset [48]. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 12

) under a threshold

α = 0.5

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ) \div 10

, our proposed loss function

L (θ) \times 2

and the number of generated clusters.

Figure A2. (a): represents a similarity matrix between 12 databases partitioned from Zoo dataset [48]. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 12

) under a threshold

α = 0.5

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ) \div 10

, our proposed loss function

L (θ) \times 2

and the number of generated clusters.

Figure A3. (a): represents a similarity matrix between 4 databases partitioned from Mushroom dataset [48] without applying the fuzziness reduction model in Figure 2. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 4

) under a threshold

α = 0.5

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ)

, our proposed loss function

L (θ) \times 10

and the number of generated clusters.

Figure A3. (a): represents a similarity matrix between 4 databases partitioned from Mushroom dataset [48] without applying the fuzziness reduction model in Figure 2. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 4

) under a threshold

α = 0.5

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ)

, our proposed loss function

L (θ) \times 10

and the number of generated clusters.

Figure A4. (a): represents a similarity matrix between 6 databases partitioned from Iris dataset [48]. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 6

) under a threshold

α = 0.2

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ)

, our proposed loss function

L (θ) \times 10

and the number of generated clusters.

Figure A4. (a): represents a similarity matrix between 6 databases partitioned from Iris dataset [48]. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 6

) under a threshold

α = 0.2

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ)

, our proposed loss function

L (θ) \times 10

and the number of generated clusters.

Figure A5. (a): represents a similarity matrix between 6 databases including 4 databases partitioned from the real dataset Mushroom [48] and 2 databases partitioned from the real dataset Zoo [48]. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 6

) under a threshold

α = 0.5

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ)

, our proposed loss function

L (θ) \times 10

and the number of generated clusters.

Figure A5. (a): represents a similarity matrix between 6 databases including 4 databases partitioned from the real dataset Mushroom [48] and 2 databases partitioned from the real dataset Zoo [48]. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 6

) under a threshold

α = 0.5

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ)

, our proposed loss function

L (θ) \times 10

and the number of generated clusters.

Figure A6. (a): represents a similarity matrix between 4 databases from [39]. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ)

, our proposed loss function

L (θ) \times 10

and the number of generated clusters.

Figure A6. (a): represents a similarity matrix between 4 databases from [39]. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ)

, our proposed loss function

L (θ) \times 10

and the number of generated clusters.

Figure A7. (a): represents a similarity matrix between 10 databases partitioned from T10I4D100K [49]. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 10

) under a threshold

α = 0.03

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ) \div 10

, our proposed loss function

L (θ) \times 10

and the number of generated clusters.

Figure A7. (a): represents a similarity matrix between 10 databases partitioned from T10I4D100K [49]. We note that (a) is built by calling

s i m

(3) on the frequent itemsets (FIs) mined from

D_{p}

(

p = 1, \dots, 10

) under a threshold

α = 0.03

. (b): represents the graph plots corresponding to

{goodness}^{3} (D) \times 10

[21],

{goodness}^{2} (D) \times 10

[23],

goodness (D)

[20], the silhouette coefficient

S C \times 10

[43], our convergence test function

L (θ) \div 10

, our proposed loss function

L (θ) \times 10

and the number of generated clusters.

Table A3. The similarity levels

δ_{o p t}

at which the clustering evaluation measures

{goodness}^{3} (D)

[21],

{goodness}^{2} (D)

[23], the silhouette coefficient

S C (D)

[43],

goodness (D)

[20] and our proposed objective function

L (θ)

attain their optimal values in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7. The fraction

\frac{| {δ_{1}, \dots, δ_{stop}} |}{| {δ_{1}, \dots, δ_{m}} |}

represents the number of similarity levels required to test the convergence and terminate divided by the number of total similarity levels m.

Table A3. The similarity levels

δ_{o p t}

at which the clustering evaluation measures

{goodness}^{3} (D)

[21],

{goodness}^{2} (D)

[23], the silhouette coefficient

S C (D)

[43],

goodness (D)

[20] and our proposed objective function

L (θ)

attain their optimal values in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7. The fraction

\frac{| {δ_{1}, \dots, δ_{stop}} |}{| {δ_{1}, \dots, δ_{m}} |}

represents the number of similarity levels required to test the convergence and terminate divided by the number of total similarity levels m.

Dataset $D$	Silhouette Coefficient $max_{\begin{matrix} θ \end{matrix}} S C (D)$		Proposed Loss Function $L (arg min_{\begin{matrix} θ \end{matrix}} L (θ))$		Goodness Measure $max_{θ} goodness (D)$		Goodness Measure $min_{θ} {goodness}^{2} (D)$		Goodness Measure $max_{θ} {goodness}^{3} (D)$
Dataset $D$	$δ_{opt}$	$\frac{\| {δ_{1}, \dots, δ_{stop}} \|}{\| {δ_{1}, \dots, δ_{m}} \|}$	$δ_{opt}$	$\frac{\| {δ_{1}, \dots, δ_{stop}} \|}{\| {δ_{1}, \dots, δ_{m}} \|}$	$δ_{opt}$	$\frac{\| {δ_{1}, \dots, δ_{stop}} \|}{\| {δ_{1}, \dots, δ_{m}} \|}$	$δ_{opt}$	$\frac{\| {δ_{1}, \dots, δ_{stop}} \|}{\| {δ_{1}, \dots, δ_{m}} \|}$	$δ_{opt}$	$\frac{\| {δ_{1}, \dots, δ_{stop}} \|}{\| {δ_{1}, \dots, δ_{m}} \|}$
Figure A1 $7 \times 7$ [20]	0.444	$\frac{10}{10}$	0.444	$\frac{5}{10}$	0.444	$\frac{10}{10}$	0.065	$\frac{10}{10}$	0.086	$\frac{10}{10}$
Figure A2 $12 \times 12$ $Z o o$ [48]	0.559	$\frac{48}{48}$	0.559	$\frac{5}{48}$	0.559	$\frac{48}{48}$	0.348	$\frac{48}{48}$	0.348	$\frac{48}{48}$
Figure A3 $4 \times 4$ $M u s h r o o m$ [48]	0.41	$\frac{4}{4}$	0.41	$\frac{2}{4}$	0.365	$\frac{4}{4}$	0.365	$\frac{4}{4}$	0.41	$\frac{4}{4}$
Figure A4 $6 \times 6$ $I r i s$ [48]	0.3	$\frac{6}{6}$	0.3	$\frac{3}{6}$	0.3	$\frac{6}{6}$	0.3	$\frac{6}{6}$	0.3	$\frac{6}{6}$
Figure A5 $6 \times 6$ $Z o o$ & $M u s h r o o m$ [48]	0.384	$\frac{8}{8}$	0.384	$\frac{7}{8}$	0.384	$\frac{8}{8}$	0.384	$\frac{8}{8}$	0.384	$\frac{8}{8}$
Figure A6 $4 \times 4$ [39]	0.429	$\frac{6}{6}$	0.429	$\frac{3}{6}$	0.25	$\frac{6}{6}$	0.25	$\frac{6}{6}$	0.429	$\frac{6}{6}$
Figure A7 $10 \times 10$ T10I4D100K [49]	0.846	$\frac{31}{31}$	0.846	$\frac{4}{31}$	0.737	$\frac{31}{31}$	0.737	$\frac{31}{31}$	0.737	$\frac{31}{31}$

Table A4. Summary of the average running times and the average clustering errors

\bar{E (P, Q)}

(22) for the proposed algorithm, BestDatabaseClustering [22] and GDMDBClustering [25] (with three different values for the learning rate

η

) on the random samples described in Table 7.

Table A4. Summary of the average running times and the average clustering errors

\bar{E (P, Q)}

(22) for the proposed algorithm, BestDatabaseClustering [22] and GDMDBClustering [25] (with three different values for the learning rate

η

) on the random samples described in Table 7.

Experiment (Figure)	Proposed Algo		BestDatabaseClustering [22]		GDMDBClustering [25]
Experiment (Figure)	Average Running Time	Average Clustering Error	Average Running Time	Average Clustering Error	Average Running Time	Average Clustering Error
Figure A8 $η = 0.001$	6.367	0.285	47.208	0.936	14.825	0.285
Figure A9 $η = 0.002$	6.367	0.285	47.208	0.936	7.305	0.290
Figure A10 $η = 0.0005$	6.367	0.285	47.208	0.936	28.479	0.285

Table A5. Statistical Friedman test results [52] for the measurements obtained by all the compared clustering algorithms in Figure A8.

Algorithm	Measurements
	Running Time					Clustering Error
	Average	SD	Var	Stat	p-Value	Average	SD	Var	Stat	p-Value
Proposed Algo	6.367	3.018	9.107			0.285	0.080	0.006
BestDatabaseClustering [22]	47.208	27.537	758.313	135.707	3.40 $e$ –30	0.936	0.066	0.004	150	$2.67 e$ –33
GDMDBClustering [25] ( $η = 0.001$ )	14.825	1.743	3.037			0.285	0.080	0.006

Table A6. Statistical Friedman test results [52] for the measurements obtained by all the compared clustering algorithms in Figure A9.

Algorithm	Measurements
	Running Time					Clustering Error
	Average	SD	Var	Stat	p-Value	Average	SD	Var	Stat	p-Value
Proposed Algo	6.367	3.018	9.107			0.285	0.080	0.006
BestDatabaseClustering [22]	47.208	27.537	758.313	121.62	3.88 $e$ –27	0.936	0.066	0.004	131	$3.56 e$ –29
GDMDBClustering [25] ( $η = 0.002$ )	7.305	1.766	3.118			0.290	0.086	0.007

Table A7. Statistical Friedman test results [52] for the measurements obtained by all the compared clustering algorithms in Figure A10.

Algorithm	Measurements
	Running Time					Clustering Error
	Average	SD	Var	Stat	p-Value	Average	SD	Var	Stat	p-Value
Proposed Algo	6.367	3.018	9.107			0.285	0.080	0.006
BestDatabaseClustering [22]	47.208	27.537	758.313	118.90	$1.51 e$ –26	0.936	0.066	0.004	150	2.67 $e$ –33
GDMDBClustering [25] ( $η = 0.0005$ )	28.479	4.655	21.669			0.285	0.080	0.006

Figure A8. (a): represents the running times corresponding to GDMDBClustering [25] (with a learning rate

η = 0.001

), BestDatabaseClustering [22] and the proposed algorithm obtained when executed on

n = 30, \dots, 120

isotropic Gaussian blobs generated using scikit-learn generator [50]. (b): represents the clustering error graphs (21) due to GDMDBClustering [25], BestDatabaseClustering [22] and the proposed algorithm.

Figure A8. (a): represents the running times corresponding to GDMDBClustering [25] (with a learning rate

η = 0.001

), BestDatabaseClustering [22] and the proposed algorithm obtained when executed on

n = 30, \dots, 120

isotropic Gaussian blobs generated using scikit-learn generator [50]. (b): represents the clustering error graphs (21) due to GDMDBClustering [25], BestDatabaseClustering [22] and the proposed algorithm.

Figure A9. (a): represents the running times corresponding to GDMDBClustering [25] (with a learning rate

η = 0.002

), BestDatabaseClustering [22] and the proposed algorithm obtained when executed on

n = 30, \dots, 120

isotropic Gaussian blobs generated using scikit-learn generator [50]. (b): represents the clustering error graphs (21) due to GDMDBClustering [25], BestDatabaseClustering [22] and the proposed algorithm.

Figure A9. (a): represents the running times corresponding to GDMDBClustering [25] (with a learning rate

η = 0.002

), BestDatabaseClustering [22] and the proposed algorithm obtained when executed on

n = 30, \dots, 120

isotropic Gaussian blobs generated using scikit-learn generator [50]. (b): represents the clustering error graphs (21) due to GDMDBClustering [25], BestDatabaseClustering [22] and the proposed algorithm.

Figure A10. (a): represents the running times corresponding to GDMDBClustering [25] (with a learning rate

η = 0.0005

), BestDatabaseClustering [22] and the proposed algorithm obtained when executed on

n = 30, \dots, 120

isotropic Gaussian blobs generated using scikit-learn generator [50]. (b): represents the clustering error graphs (21) due to GDMDBClustering [25], BestDatabaseClustering [22] and the proposed algorithm.

Figure A10. (a): represents the running times corresponding to GDMDBClustering [25] (with a learning rate

η = 0.0005

), BestDatabaseClustering [22] and the proposed algorithm obtained when executed on

n = 30, \dots, 120

isotropic Gaussian blobs generated using scikit-learn generator [50]. (b): represents the clustering error graphs (21) due to GDMDBClustering [25], BestDatabaseClustering [22] and the proposed algorithm.

Table A8. F-measure [60,61], precision [60,61] and recall [60,61] reached by the compared clustering algorithms in [21,22,23], and our proposed algorithm on the datasets shown in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7. Notice that our algorithm gets the best scores for all the datasets.

Dataset	F-Measure [60,61]				Precision [60,61]				Recall [60,61]
Dataset	Proposed Algo	Algo [22]	Algo [23]	Algo [21]	Proposed Algo	Algo [22]	Algo [23]	Algo [21]	Proposed Algo	Algo [22]	Algo [23]	Algo [21]
Figure A1 7 × 7 [20]	1	1	0.44	0.8	1	1	0.28	0.66	1	1	1	1
Figure A2 12 × 12 $Z o o$ [48]	1	1	0.14	0.14	1	1	0.075	0.075	1	1	1	1
Figure A3 4 × 4 $M u s h r o o m$ [48]	1	0.66	0.66	1	1	0.5	0.5	1	1	1	1	1
Figure A4 6 × 6 $I r i s$ [48]	1	1	1	1	1	1	1	1	1	1	1	1
Figure A5 6 × 6 $Z o o$ & $M u s h r o o m$ [48]	1	1	1	1	1	1	1	1	1	1	1	1
Figure A6 4 × 4 [39]	1	0.66	0.66	1	1	0.5	0.5	1	1	1	1	1
Figure A7 10 × 10 T10I4D100K [49]	1	0.16	0.16	0.16	1	0.088	0.088	0.088	1	1	1	1

Table A9. Contingency matrix showing the categories in pairing clustered databases.

Clustering	Predicted Clusters
Clustering	Pairs in $P$	Pairs Not in $P$
Actual clusters
Pairs in $Q$	$a : = \| P a i r s_{Q} \cap P a i r s_{P} \|$ (True Positive)	$b : = \| P a i r s_{Q} \ P a i r s_{P} \|$ (False Negative)
Pairs not in $Q$	$c : = \| P a i r s_{P} \ P a i r s_{Q} \|$ (False Positive)	$d : =$ Pairs in none (True Negative)

Table A10. Pair counting measures used for clustering assessment and comparison.

Precision [60,61]	Recall [60,61]	F-Measure [60,61]	Rand [62]	Jaccard [63]
$\frac{(a)}{(a) + (c)}$	$\frac{(a)}{(a) + (b)}$	$\frac{2 (a)}{2 (a) + (b) + (c)}$	$\frac{(a) + (d)}{(a) + (b) + (c) + (d)}$	$\frac{(a)}{(a) + (b) + (c)}$

Table A11. Rand index [62] and Jaccard index [63] reached by the clustering algorithms in [21,22,23] and our proposed algorithm on the datasets shown in Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, and Figure A7. Notice that our algorithm gets the best scores for all the datasets.

Dataset	Rand [62]				Jaccard [63]
Dataset	Proposed Algo	Algo [22]	Algo [23]	Algo [21]	Proposed Algo	Algo [22]	Algo [23]	Algo [21]
Figure A1 $7 \times 7$ [20]	1	1	0.28	0.85	1	1	0.28	0.66
Figure A2 $12 \times 12$ $Z o o$ [48]	1	1	0.075	0.075	1	1	0.075	0.075
Figure A3 $4 \times 4$ $M u s h r o o m$ [48]	1	0.5	0.5	1	1	0.5	0.5	1
Figure A4 $6 \times 6$ $I r i s$ [48]	1	1	1	1	1	1	1	1
Figure A5 $6 \times 6$ $Z o o$ & $M u s h r o o m$ [48]	1	1	1	1	1	1	1	1
Figure A6 $4 \times 4$ [39]	1	0.5	0.5	1	1	0.5	0.5	1
Figure A7 $10 \times 10$ T10I4D100K [49]	1	0.088	0.088	0.088	1	0.088	0.088	0.088

References

Han, J.; Pei, J.; Yin, Y.; Mao, R. Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Min. Knowl. Discov. 2004, 8, 53–87. [Google Scholar] [CrossRef]
Ng, A.Y.; Jordan, M.I.; Weiss, Y. On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2001, 849–856. [Google Scholar] [CrossRef]
Johnson, S.C. Hierarchical clustering schemes. Psychometrika 1967, 32, 241–254. [Google Scholar] [CrossRef] [PubMed]
MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, 27 December 1965–7 January 1966; Volume 1, pp. 281–297. [Google Scholar]
Zhang, Y.J.; Liu, Z.Q. Self-splitting competitive learning: A new on-line clustering paradigm. IEEE Trans. Neural Netw. 2002, 13, 369–380. [Google Scholar] [CrossRef]
Yair, E.; Zeger, K.; Gersho, A. Competitive learning and soft competition for vector quantizer design. IEEE Trans. Signal Process. 1992, 40, 294–309. [Google Scholar] [CrossRef]
Hofmann, T.; Buhmann, J.M. Competitive learning algorithms for robust vector quantization. IEEE Trans. Signal Process. 1998, 46, 1665–1675. [Google Scholar] [CrossRef] [Green Version]
Kohonen, T. Self-Organizing Maps; Springer Science & Business Media: Berlin/Heidelberg, Germany; New York, NY, USA, 2012; Volume 30. [Google Scholar]
Pal, N.R.; Bezdek, J.C.; Tsao, E.K. Generalized clustering networks and Kohonen’s self-organizing scheme. IEEE Trans. Neural Netw. 1993, 4, 549–557. [Google Scholar] [CrossRef]
Mao, J.; Jain, A.K. A self-organizing network for hyperellipsoidal clustering (HEC). Trans. Neural Netw. 1996, 7, 16–29. [Google Scholar]
Anderberg, M.R. Cluster Analysis for Applications: Probability and Mathematical Statistics: A Series of Monographs and Textbooks; Academic Press: Cambridge, MA, USA, 2014; Volume 19. [Google Scholar]
Aggarwal, C.C.; Reddy, C.K. Data clustering. Algorithms and Application; CRC Press: Boca Raton, FL, USA, 2014. [Google Scholar]
Wang, C.D.; Lai, J.H.; Philip, S.Y. NEIWalk: Community discovery in dynamic content-based networks. IEEE Trans. Knowl. Data Eng. 2013, 26, 1734–1748. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, D.; Zhou, X.; Yang, D.; Yu, Z.; Yu, Z. Discovering and profiling overlapping communities in location-based social networks. IEEE Trans. Syst. Man Cybern. Syst. 2013, 44, 499–509. [Google Scholar] [CrossRef] [Green Version]
Huang, D.; Lai, J.H.; Wang, C.D.; Yuen, P.C. Ensembling over-segmentations: From weak evidence to strong segmentation. Neurocomputing 2016, 207, 416–427. [Google Scholar] [CrossRef]
Shi, J.; Malik, J. Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 888–905. [Google Scholar]
Zhao, Q.; Wang, C.; Wang, P.; Zhou, M.; Jiang, C. A novel method on information recommendation via hybrid similarity. IEEE Trans. Syst. Man Cybern. Syst. 2016, 48, 448–459. [Google Scholar] [CrossRef]
Symeonidis, P. ClustHOSVD: Item recommendation by combining semantically enhanced tag clustering with tensor HOSVD. IEEE Trans. Syst. Man Cybern. Syst. 2015, 46, 1240–1251. [Google Scholar] [CrossRef]
Rafailidis, D.; Daras, P. The TFC model: Tensor factorization and tag clustering for item recommendation in social tagging systems. IEEE Trans. Syst. Man Cybern. Syst. 2012, 43, 673–688. [Google Scholar] [CrossRef]
Adhikari, A.; Adhikari, J. Clustering Multiple Databases Induced by Local Patterns. In Advances in Knowledge Discovery in Batabases; Springer: Cham, Switzerland, 2015; pp. 305–332. [Google Scholar]
Liu, Y.; Yuan, D.; Cuan, Y. Completely Clustering for Multi-databases Mining. J. Comput. Inf. Syst. 2013, 9, 6595–6602. [Google Scholar]
Miloudi, S.; Hebri, S.A.R.; Khiat, S. Contribution to Improve Database Classification Algorithms for Multi-Database Mining. J. Inf. Proces. Syst. 2018, 14, 709–726. [Google Scholar]
Tang, H.; Mei, Z. A Simple Methodology for Database Clustering. In Proceedings of the 5th International Conference on Computer Engineering and Networks, SISSA Medialab, Shanghai, China, 12–13 September 2015; Volume 259, p. 19. [Google Scholar]
Wang, R.; Ji, W.; Liu, M.; Wang, X.; Weng, J.; Deng, S.; Gao, S.; Yuan, C.A. Review on mining data from multiple data sources. Pattern Recognit. Lett. 2018, 109, 120–128. [Google Scholar] [CrossRef]
Miloudi, S.; Wang, Y.; Ding, W. A Gradient-Based Clustering for Multi-Database Mining. IEEE Access 2021, 9, 11144–11172. [Google Scholar] [CrossRef]
Miloudi, S.; Wang, Y.; Ding, W. An Optimized Graph-based Clustering for Multi-database Mining. In Proceedings of the 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), Baltimore, MD, USA, 9–11 November 2020; pp. 807–812. [Google Scholar] [CrossRef]
Zhang, S.; Zaki, M.J. Mining Multiple Data Sources: Local Pattern Analysis. Data Min. Knowl. Discov. 2006, 12, 121–125. [Google Scholar] [CrossRef] [Green Version]
Adhikari, A.; Rao, P.R. Synthesizing heavy association rules from different real data sources. Pattern Recognit. Lett. 2008, 29, 59–71. [Google Scholar] [CrossRef]
Adhikari, A.; Adhikari, J. Advances in Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2015. [Google Scholar]
Adhikari, A.; Jain, L.C.; Prasad, B. A State-of-the-Art Review of Knowledge Discovery in Multiple Databases. J. Intell. Syst. 2017, 26, 23–34. [Google Scholar] [CrossRef] [Green Version]
Zhang, S.; Zhang, C.; Wu, X. Identifying Exceptional Patterns. Knowl. Discov. Multiple Datab. 2004, 185–195. [Google Scholar]
Zhang, S.; Zhang, C.; Wu, X. Identifying High-vote Patterns. Knowl. Discov. Multiple Datab. 2004, 157–183. [Google Scholar]
Ramkumar, T.; Srinivasan, R. Modified algorithms for synthesizing high-frequency rules from different data sources. Knowl. Inf. Syst. 2008, 17, 313–334. [Google Scholar] [CrossRef]
Djenouri, Y.; Lin, J.C.W.; Nørvåg, K.; Ramampiaro, H. Highly efficient pattern mining based on transaction decomposition. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 1646–1649. [Google Scholar]
Savasere, A.; Omiecinski, E.R.; Navathe, S.B. An Efficient Algorithm for Mining Association Rules in Large Databases; Technical Report GIT-CC-95-04; Georgia Institute of Technology: Zurich, Switzerland, 1995. [Google Scholar]
Zhang, S.; Wu, X. Large scale data mining based on data partitioning. Appl. Artif. Intel. 2001, 15, 129–139. [Google Scholar] [CrossRef]
Zhang, C.; Liu, M.; Nie, W.; Zhang, S. Identifying Global Exceptional Patterns in Multi-database Mining. IEEE Intell. Inform. Bull. 2004, 3, 19–24. [Google Scholar]
Zhang, S.; Zhang, C.; Yu, J.X. An efficient strategy for mining exceptions in multi-databases. Inf. Sci. 2004, 165, 1–20. [Google Scholar] [CrossRef]
Wu, X.; Zhang, C.; Zhang, S. Database classification for multi-database mining. Inf. Syst. 2005, 30, 71–88. [Google Scholar] [CrossRef]
Li, H.; Hu, X.; Zhang, Y. An Improved Database Classification Algorithm for Multi-database Mining. In Frontiers in Algorithmics; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2009; pp. 346–357. [Google Scholar]
Na, S.; Xumin, L.; Yong, G. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In Proceedings of the 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, Jian, China, 2–4 April 2010; pp. 63–67. [Google Scholar]
Selim, S.Z.; Ismail, M.A. K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intel. 1984, 81–87. [Google Scholar] [CrossRef] [PubMed]
Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 344. [Google Scholar]
De Luca, A.; Termini, S. A Definition of a Nonprobabilistic Entropy in the Setting of Fuzzy Sets Theory. In Readings in Fuzzy Sets for Intelligent Systems; Dubois, D., Prade, H., Yager, R.R., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 1993; pp. 197–202. [Google Scholar] [CrossRef]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Data structures for disjoint sets. In Introduction to Algorithms; MIT Press: Cambridge, MA, USA, 2009; pp. 498–524. [Google Scholar]
Center for Machine Learning and Intelligent Systems. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/datasets/ (accessed on 10 October 2020).
IBM Almaden Quest Research Group. Frequent Itemset Mining Dataset Repository. Available online: http://fimi.ua.ac.be/data/. (accessed on 10 October 2020).
Thirion, G.; Varoquaux, A.; Gramfort, V.; Michel, O.; Grisel, G.; Louppe, J. Nothman. Scikit-learn: Sklearn.datasets.makeblobs. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html (accessed on 10 October 2020).
Gramfort, A.; Blondel, M.; Grisel, O.; Mueller, A.; Martin, E.; Patrini, G.; Chang, E. Scikit-Learn: Sklearn.preprocessing.MinMaxScaler. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html (accessed on 10 October 2020).
Friedman, M. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
Meilǎ, M. Comparing clusterings: An axiomatic view. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 577–584. [Google Scholar]
Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
Günnemann, S.; Färber, I.; Müller, E.; Assent, I.; Seidl, T. External evaluation measures for subspace clustering. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Scotland, UK, 24–28 October 2011; pp. 1363–1372. [Google Scholar]
Banerjee, A.; Krumpelman, C.; Ghosh, J.; Basu, S.; Mooney, R.J. Model-based overlapping clustering. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, IL, USA, 21–24 August 2005; pp. 532–537. [Google Scholar]
Pfitzner, D.; Leibbrandt, R.; Powers, D. Characterization and evaluation of similarity measures for pairs of clusterings. Knowl. Inf. Syst. 2009, 19, 361–394. [Google Scholar] [CrossRef]
Achtert, E.; Goldhofer, S.; Kriegel, H.P.; Schubert, E.; Zimek, A. Evaluation of clusterings–metrics and visual support. In Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, Arlington, VA, USA, 1–5 April 2012; pp. 1285–1288. [Google Scholar]
Shafiei, M.; Milios, E. Model-based overlapping co-clustering. In Proceedings of the SIAM Conference on Data Mining, Bethesda, MD, USA, 20–22 April 2006. [Google Scholar]
Chinchor, N. MUC-4 evaluation metrics. In Proceedings of the of the Fourth Message Understanding Conference, McLean, VA, USA, 16–18 June 1992. [Google Scholar]
Mei, Q.; Radev, D. Information retrieval. In The Oxford Handbook of Computational Linguistics, 2nd ed.; Oxford University Press: New York, NY, USA, 1979. [Google Scholar]
Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
Jaccard, P. Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bull. Soc. Vaudoise Sci. Nat. 1901, 37, 241–272. [Google Scholar]

Figure 1. (a): represents (in green) the graph of the piecewise linear activation function

g (\cdot)

and (in red) its partial derivative. We note that

z_{p, q} = θ_{p, q} \times x_{p, q}

, and

θ_{p, q}

is the weight associated with the similarity value

x_{p, q} = s i m (D_{p}, D_{q})

,

sgn : R \to {- 1, 1}

is the signum function and

ϵ

is a small number

(\approx 1 e

–

7)

ensuring that

g (z_{p, q}, ϵ)

is always above 0 and below 1. (b): represents the binary entropy function

H : (0, 1) \to (0, 1]

in blue and its partial derivative in orange.

Figure 1. (a): represents (in green) the graph of the piecewise linear activation function

g (\cdot)

and (in red) its partial derivative. We note that

z_{p, q} = θ_{p, q} \times x_{p, q}

, and

θ_{p, q}

is the weight associated with the similarity value

x_{p, q} = s i m (D_{p}, D_{q})

,

sgn : R \to {- 1, 1}

is the signum function and

ϵ

is a small number

(\approx 1 e

–

7)

ensuring that

g (z_{p, q}, ϵ)

is always above 0 and below 1. (b): represents the binary entropy function

H : (0, 1) \to (0, 1]

in blue and its partial derivative in orange.

Figure 2. Proposed fuzziness reduction model on the

(n^{2} - n) / 2

pairwise similarities

x_{p, q} = s i m (D_{p}, D_{q})

,

p = 0, \dots, n - 2, q = p + 1, \dots, n - 1

. We note that the graphs corresponding to the activation function

g (\cdot)

and the binary entropy function

H (\cdot)

are plotted in Figure 1.

Figure 2. Proposed fuzziness reduction model on the

(n^{2} - n) / 2

pairwise similarities

x_{p, q} = s i m (D_{p}, D_{q})

,

p = 0, \dots, n - 2, q = p + 1, \dots, n - 1

. We note that the graphs corresponding to the activation function

g (\cdot)

and the binary entropy function

H (\cdot)

are plotted in Figure 1.

Figure 3. A simplified 3D plot of our proposed loss function

L (θ)

as defined in (14), where

θ = [θ_{1}, θ_{2}]

for visualization purposes.

P_{1}, P_{2}, P_{3}, P_{4}, A, B, C

are some selected 3D points at which

L (θ)

is evaluated. From

P_{1}

all the way down to

P_{4}

, we can clearly see that

L (θ)

decreases monotonically when the coordinate variables

θ_{1}

and

θ_{2}

increase their values. That is,

\forall (θ_{1}^{(i)}, θ_{2}^{(i)}, θ_{1}^{(i - 1)}, θ_{2}^{(i - 1)}) \in R^{4} | θ_{1}^{(i)} \geq θ_{1}^{(i - 1)} \land θ_{2}^{(i)} \geq θ_{2}^{(i - 1)}, L (θ_{1}^{(i)}, θ_{2}^{(i)}) \leq L (θ_{1}^{(i - 1)}, θ_{2}^{(i - 1)})

, where i is an integer representing the current iteration in our algorithm.

Figure 3. A simplified 3D plot of our proposed loss function

L (θ)

as defined in (14), where

θ = [θ_{1}, θ_{2}]

for visualization purposes.

P_{1}, P_{2}, P_{3}, P_{4}, A, B, C

are some selected 3D points at which

L (θ)

is evaluated. From

P_{1}

all the way down to

P_{4}

, we can clearly see that

L (θ)

decreases monotonically when the coordinate variables

θ_{1}

and

θ_{2}

increase their values. That is,

\forall (θ_{1}^{(i)}, θ_{2}^{(i)}, θ_{1}^{(i - 1)}, θ_{2}^{(i - 1)}) \in R^{4} | θ_{1}^{(i)} \geq θ_{1}^{(i - 1)} \land θ_{2}^{(i)} \geq θ_{2}^{(i - 1)}, L (θ_{1}^{(i)}, θ_{2}^{(i)}) \leq L (θ_{1}^{(i - 1)}, θ_{2}^{(i - 1)})

, where i is an integer representing the current iteration in our algorithm.

Figure 4. The coordinate descent-based clustering model depicted in eleven steps.

Figure 5. (a) A 5 × 5 similarity matrix between five transactional databases before applying our fuzziness reduction model. (b) Represents the graph plots corresponding to

g o o d n e s s (D)

[20], the silhouette coefficient [43] and the number of clusters. (c) Represents the optimal graph obtained at

max g o o d n e s s (D)

[20].

Figure 5. (a) A 5 × 5 similarity matrix between five transactional databases before applying our fuzziness reduction model. (b) Represents the graph plots corresponding to

g o o d n e s s (D)

[20], the silhouette coefficient [43] and the number of clusters. (c) Represents the optimal graph obtained at

max g o o d n e s s (D)

[20].

Figure 6. (a) Represents a similarity matrix between four databases partitioned from the Mushroom dataset [48]. We note that (a) is built by calling sim (3) on the frequent itemsets (FIs) mined from D_p (p = 1, …, 4) under a threshold α = 0.5. (b) Represents the graph plots corresponding to

g o o d n e s s (D)

[20], the silhouette coefficient [43] and the number of clusters. (c) Represents the optimal graph obtained at

max g o o d n e s s (D)

[20].

Figure 6. (a) Represents a similarity matrix between four databases partitioned from the Mushroom dataset [48]. We note that (a) is built by calling sim (3) on the frequent itemsets (FIs) mined from D_p (p = 1, …, 4) under a threshold α = 0.5. (b) Represents the graph plots corresponding to

g o o d n e s s (D)

[20], the silhouette coefficient [43] and the number of clusters. (c) Represents the optimal graph obtained at

max g o o d n e s s (D)

[20].

Figure 7. (a) A 5 × 5 similarity matrix obtained after applying our fuzziness reduction model on Figure 5a. (b) Represents the graph plots corresponding to

g o o d n e s s (D)

[20], the silhouette coefficient [43] and the number of clusters. (c) Represents the optimal graph obtained at

max g o o d n e s s (D)

[20].

Figure 7. (a) A 5 × 5 similarity matrix obtained after applying our fuzziness reduction model on Figure 5a. (b) Represents the graph plots corresponding to

g o o d n e s s (D)

[20], the silhouette coefficient [43] and the number of clusters. (c) Represents the optimal graph obtained at

max g o o d n e s s (D)

[20].

Figure 8. (a) Represents the similarity table generated after applying our fuzziness reduction model on Figure 6a. (b) Represents the graph plots corresponding to

g o o d n e s s (D)

[20], the silhouette coefficient [43] and the number of clusters. (c) Represents the optimal graph obtained at

max g o o d n e s s (D)

[20].

Figure 8. (a) Represents the similarity table generated after applying our fuzziness reduction model on Figure 6a. (b) Represents the graph plots corresponding to

g o o d n e s s (D)

[20], the silhouette coefficient [43] and the number of clusters. (c) Represents the optimal graph obtained at

max g o o d n e s s (D)

[20].

Table 1. Six transactional databases

D_{p}

, for

p = 1, \dots, 6

.

Table 1. Six transactional databases

D_{p}

, for

p = 1, \dots, 6

.

Transactional Database $(D_{p})$	Transactions/Rows
$D_{1}$	$(A, C), (A, B, C), (B, C), (A, B, C, D)$
$D_{2}$	$(A, B, C), (B, C), (A, B), (A, C), (A, B, D)$
$D_{3}$	$(B, C), (A, D), (B, C, D), (A, B, C)$
$D_{4}$	$(E, F, H), (F, H), (F, G, H, I, J)$
$D_{5}$	$(E, J), (F, H, J), (E, F, H, J), (F, H)$
$D_{6}$	$(E, I), (E, F, H), (F, H, I, J), (E, H, J)$

Table 2. The frequent itemsets (FIs) discovered from each transactional database in Table 1 under a threshold

α = 0.5

.

Table 2. The frequent itemsets (FIs) discovered from each transactional database in Table 1 under a threshold

α = 0.5

.

Transactional Database $(D_{p})$	Frequent Itemsets $FIS (D_{p}, α)$
$D_{1}$	$〈 A C, 0.75 〉, 〈 A B, 0.5 〉, 〈 A B C, 0.5 〉, 〈 B C, 0.75 〉, 〈 C, 1.0 〉, 〈 B, 0.75 〉, 〈 A, 0.75 〉$
$D_{2}$	$〈 A B, 0.6 〉, 〈 C, 0.6 〉, 〈 B, 0.8 〉, 〈 A, 0.8 〉$
$D_{3}$	$〈 B C, 0.75 〉, 〈 D, 0.5 〉, 〈 C, 0.75 〉, 〈 B, 0.75 〉, 〈 A, 0.5 〉$
$D_{4}$	$〈 H, 1.0 〉, 〈 F, 1.0 〉, 〈 F H, 1.0 〉$
$D_{5}$	$〈 E, 0.5 〉, 〈 E J, 0.5 〉, 〈 J, 0.75 〉, 〈 H J, 0.5 〉, 〈 F H J, 0.5 〉, 〈 F J, 0.5 〉, 〈 H, 0.75 〉, 〈 F H, 0.75 〉, 〈 F, 0.75 〉$
$D_{6}$	$〈 I, 0.5 〉, 〈 J, 0.5 〉, 〈 H J, 0.5 〉, 〈 F, 0.5 〉, 〈 F H, 0.5 〉, 〈 E, 0.75 〉, 〈 E H, 0.5 〉, 〈 H, 0.75 〉$

Table 3. A summary of the clustering quality measures mentioned in this paper.

Clustering Quality [Reference]	Function (Equation)	Optimal Value
[20]	$goodness (D) = B (D) + W (D) - f (D)$ $\{\begin{matrix} B (D) = \sum_{C_{t}, C_{v} \in C; t < v} \sum_{D_{p} \in C_{t}, D_{q} \in C_{v}; p < q} (1 - s i m (D_{p}, D_{q})) \\ W (D) = \sum_{C_{t} \in C} \sum_{D_{p}, D_{q} \in C_{t}; p < q} s i m (D_{p}, D_{q}) \times 𝟙 {(D_{p}, D_{q}) \in E} \\ f (D) : number of clusters . \end{matrix}$	$\begin{matrix} max g o o d n e s s (D) \end{matrix}$
[23]	${goodness}^{2} (D) = \frac{sum - dist (D)}{(n^{2} - n) / 2} + \frac{coupling (D)}{(n^{2} - n) / 2} + \frac{f (D) - 1}{n - 1}$ $\{\begin{matrix} s u m - d i s t (D) = \sum_{C_{t} \in C} \sum_{D_{p}, D_{q} \in C_{t}; p < q} (1 - s i m (D_{p}, D_{q})) \times 𝟙 {(D_{p}, D_{q}) \in E} \\ c o u p l i n g (D) = \sum_{C_{t}, C_{v} \in C; t < v} \sum_{D_{p} \in C_{t}, D_{q} \in C_{v}; p < q} s i m (D_{p}, D_{q}) \end{matrix}$	$\begin{matrix} min g o o d n e s s^{2} (D) \end{matrix}$
[21]	${goodness}^{3} (D) = \frac{intra - sim (D) + inter - dist (D)}{f (D)}$ $\{\begin{matrix} i n t r a - s i m (D) = \frac{1}{f (D)} \sum_{C_{t} \in C} \{\begin{matrix} 1, \| C_{t} \| = 1 \\ \frac{\sum_{D_{p}, D_{q} \in C_{t}} s i m (D_{p}, D_{q}) \times 𝟙 {(D_{p}, D_{q}) \in E}}{(\| C_{t} \|^{2} - \| C_{t} \|) / 2}, \| C_{t} \| > 1 \end{matrix} \\ i n t e r - d i s t (D) = \{\begin{matrix} 0, f (D) = 1 \\ \sum_{C_{t}, C_{v} \in C} \frac{2 \times \sum_{D_{p} \in C_{t}, D_{q} \in C_{v}; p < q} (1 - s i m (D_{p}, D_{q}))}{\| C_{t} \| \times \| C_{v} \| \times (f {(D)}^{2} - f (D))}, f (D) > 1 \end{matrix} \end{matrix}$	$\begin{matrix} max g o o d n e s s^{3} (D) \end{matrix}$
[43,44]	$SC (D) = \frac{1}{n} \sum_{p = 0}^{n - 1} s (D_{p})$ $\{\begin{matrix} s (D_{p}) = \{\begin{matrix} \frac{b (D_{p}) - a (D_{p})}{max {a (D_{p}), b (D_{p})}}, \| C_{p} \| > 1; \\ 0, i f \| C_{p} \| = 1 \end{matrix} \\ \{\begin{matrix} a (D_{p}) = \frac{\sum_{D_{p}, D_{q} \in C_{p}, p < q} (1 - s i m (D_{p}, D_{q})) \times 𝟙 {(D_{p}, D_{q}) \in E}}{\| C_{p} - 1 \|} \\ b (D_{p}) = {min}_{D_{p} \notin C_{q}} \frac{1}{\| C_{q} \|} \sum_{D_{q} \in C_{q}} (1 - s i m (D_{p}, D_{q})) \end{matrix} \end{matrix}$	$\begin{matrix} max S C (D) \end{matrix}$

Table 4. Clustering the three databases

D_{1}

,

D_{2}

and

D_{3}

under the similarity measure

s i m i

[20] against our proposed measure

s i m

(3).

Table 4. Clustering the three databases

D_{1}

,

D_{2}

and

D_{3}

under the similarity measure

s i m i

[20] against our proposed measure

s i m

(3).

Output	Clustering 1 under simi [20]	Clustering 2 under sim (3)
clusters	${D_{1}}, {D_{2}, D_{3}}$	${D_{1}, D_{2}}, {D_{3}}$
Similarity intra-cluster	0.6	0.75
Distance inter-cluster	1.6	1.75
Measure goodness [20]	0.2	0.5

Table 5. Itemsets synthesized from

C_{2, 3} = {D_{2}, D_{3}}

discovered under

s i m i

[20] against the itemsets synthesized from

C_{1, 2} = {D_{1}, D_{2}}

discovered under

s i m

(3).

Table 5. Itemsets synthesized from

C_{2, 3} = {D_{2}, D_{3}}

discovered under

s i m i

[20] against the itemsets synthesized from

C_{1, 2} = {D_{1}, D_{2}}

discovered under

s i m

(3).

Synthesized Itemsets $I_{k}$	$supp (I_{k}, C_{2, 3})$ under simi [20]	$supp (I_{k}, C_{1, 2})$ under sim (3)
A	$0.12 < α_{2, 3} = 0.19$	$0.2 > α_{1, 2} = 0.17$
B	$0.12 < α_{2, 3} = 0.19$	$0.2 > α_{1, 2} = 0.17$
C	$0.12 < α_{2, 3} = 0.19$	$0.2 > α_{1, 2} = 0.17$
E	$0.9 > α_{2, 3} = 0.19$	$0.54 > α_{1, 2} = 0.17$

Table 6. A summary of the results obtained in Figure 5, Figure 6, Figure 7 and Figure 8. We note that

δ_{opt}

is the optimal similarity level at which

goodness (D)

[20] attains its maximum value, and

θ^{T}

is the optimal weight vector learned after a number of epochs.

Table 6. A summary of the results obtained in Figure 5, Figure 6, Figure 7 and Figure 8. We note that

δ_{opt}

is the optimal similarity level at which

goodness (D)

[20] attains its maximum value, and

θ^{T}

is the optimal weight vector learned after a number of epochs.

Similarity Matrix	Fuzziness Index (9)	$θ^{T}, epochs, η$	max $goodness (D)$ [20]	$δ_{opt}$	$SC (D)$ [43,44] at $δ_{opt}$	Optimal Clustering at $δ_{opt}$
Figure 5	0.97	$θ^{T} = [1, 1, \dots, 1, 1]$ (Without fuzziness reduction)	4.19	0.46	−1	${D_{1}, D_{2}, D_{3}, D_{4}, D_{5}}$
Figure 5	0.95	$θ^{T} = [1, 1, \dots, 1, 1]$ (Without fuzziness reduction)	1.29	0.313	−1	${D_{1}, D_{2}, D_{3}, D_{4}}$
Figure 6	0.74	$θ^{T}$ = [1.30,0.52,0.71,0.71,0.52, 0.71,0.71,0.52,0.52,1.44], $e p o c h s = 300$ , $η = 0.1$	4.54	0.95	0.73	${D_{1}, D_{2}}, {D_{3}}, {D_{4}, D_{5}}$
Figure 6	0.81	$θ^{T} = [0.63, 0.638, 0.591, 0.712, 0.712, 0.77]$ , $e p o c h s = 100$ , $η = 0.1$	1.27	0.292	0.08	${D_{4}, D_{3}, D_{2}}, {D_{1}}$

Table 7. A brief summary of the random blobs generated via scikit-learn [50].

Number of Random Blobs $(n)$	Number of Centers $⌊ \frac{n}{2} ⌋$	Number of Attributes $(m)$
30	15	random.randint(2, 10)
⋮	⋮	⋮
60	30	random.randint(2, 10)
⋮	⋮	⋮
120	60	random.randint(2, 10)

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Miloudi, S.; Wang, Y.; Ding, W. An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining. Entropy 2021, 23, 553. https://doi.org/10.3390/e23050553

AMA Style

Miloudi S, Wang Y, Ding W. An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining. Entropy. 2021; 23(5):553. https://doi.org/10.3390/e23050553

Chicago/Turabian Style

Miloudi, Salim, Yulin Wang, and Wenjia Ding. 2021. "An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining" Entropy 23, no. 5: 553. https://doi.org/10.3390/e23050553

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Similarity-Based Clustering Algorithm for Multi-Database Mining

Abstract

1. Introduction

2. Motivation and Related Work

2.1. Motivating Example

2.2. Prior Work

3. Materials and Methods

3.1. Background and Relevant Concepts

3.1.1. Similarity Measure

3.1.2. Clustering Generation and Evaluation

3.2. Similarity Matrix Fuzziness Reduction

3.2.1. Fuzziness Index

3.2.2. Proposed Model and Algorithm

3.3. Proposed Coordinate Descent-Based Clustering

3.3.1. Proposed Loss Function and Algorithm

3.3.2. Time Complexity Analysis

4. Performance Evaluation

4.1. Similarity Accuracy Analysis

4.2. Fuzziness Reduction Analysis

4.3. Convexity and Clustering Analysis

4.4. Clustering Error and Running Time Analysis

4.5. Clustering Comparison and Assessment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI