Determining the Quality of a Dataset in Clustering Terms

Rachwał, Alicja; Popławska, Emilia; Gorgol, Izolda; Cieplak, Tomasz; Pliszczuk, Damian; Skowron, Łukasz; Rymarczyk, Tomasz

doi:10.3390/app13052942

Open AccessArticle

Determining the Quality of a Dataset in Clustering Terms

by

Alicja Rachwał

^1,*

,

Emilia Popławska

^2,*

,

Izolda Gorgol

²

,

Tomasz Cieplak

³

,

Damian Pliszczuk

⁴

,

Łukasz Skowron

³

and

Tomasz Rymarczyk

^4,5

¹

Faculty of Electrical Engineering and Computer Science, Lublin University of Technology, 20-618 Lublin, Poland

²

Faculty of Technology Fundamentals, Lublin University of Technology, 20-618 Lublin, Poland

³

Faculty of Management, Lublin University of Technology, 20-618 Lublin, Poland

⁴

Netrix S.A. Research and Development Center, 20-704 Lublin, Poland

⁵

Faculty of Computer Science, WSEI University, 20-209 Lublin, Poland

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2023, 13(5), 2942; https://doi.org/10.3390/app13052942

Submission received: 20 December 2022 / Revised: 9 February 2023 / Accepted: 23 February 2023 / Published: 24 February 2023

(This article belongs to the Special Issue Data Science, Statistics and Visualization)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The purpose of the theoretical considerations and research conducted was to indicate the instruments with which the quality of a dataset can be verified for the segmentation of observations occurring in the dataset. The paper proposes a novel way to deal with mixed datasets containing categorical and continuous attributes in a customer segmentation task. The categorical variables were embedded using an innovative unsupervised model based on an autoencoder. The customers were then divided into groups using different clustering algorithms, based on similarity matrices. In addition to the classic k-means method and the more modern DBSCAN, three graph algorithms were used: the Louvain algorithm, the greedy algorithm and the label propagation algorithm. The research was conducted on two datasets: one containing on retail customers and the other containing wholesale customers. The Calinski–Harabasz index, Davies–Bouldins index, NMI index, Fowlkes–Mallows index and silhouette score were used to assess the quality of the clustering. It was noted that the modularity parameter for graph methods was a good indicator of whether a given set could be meaningfully divided into groups.

Keywords:

artificial intelligence; statistical learning; machine learning; decision-making based on data-driven models; data set quality; clustering; communities in graphs

1. Introduction

Customer segmentation is the process of dividing the entire market into smaller customer groups. It enables one to understand the different customers’ needs and behaviours and adapt the appropriate marketing approach or product recommendations to them [1,2,3,4]. Since in most cases, we do not know in advance what group a given customer belongs to, we are dealing here mainly with unsupervised learning. The datasets whose clustering quality we determine do not have specified labels based on which we could test the accuracy of the prediction. We need to focus on methods that do not require the number of clusters or use metrics that allow us to determine the optimal number of clusters. Because our methodology does not use supervised learning, it would be easier to apply in practice, where most use cases are those of unsupervised learning. Clustering is one of the most common tasks in the field of unsupervised learning. It is a technique for grouping objects into clusters. Objects with the most similarities remain in a group and have less or no similarities with another group’s objects. Clustering mainly deals with finding a structure or pattern in a collection of uncategorised data.

For the purposes given above, this paper focuses on unsupervised clustering, meaning no fixed number of groups. For such clustering, the hierarchical algorithms are very widely used. Such algorithms are intuitive and easy to implement, hence their popularity. To determine the number of clusters, one would have to take a look at the obtained dendrogram.

One of our goals is to overcome the limitations of hierarchical clustering [5], namely, that they do not work well on vast quantities of data, they have difficulties with handling clusters of different sizes, they focus mostly on breaking into large clusters, the order of the data has an impact on the final results, and, most importantly, they are sensitive to noise and outliers. Because of all those drawbacks, we decided to restrain from using hierarchical clustering and instead focused on graph methods and the spatial DBSCAN algorithm which, in its definition, was built to deal with noise and outliers.

Our motivation was also to tackle the problem of mixed datasets, meaning datasets containing both numerical and categorical variables. For the latter, the computation of the similarity or distance between observations is impossible, because they are not in numerical forms. Basic methods of dealing with categorical variables include one-hot encoding or label encoding. However, label encoding is more suitable for ordered categorical data, and one-hot encoding has the disadvantages of generating high-dimensional data and disregarding eventual similarities between the levels of a variable.

We intend to cope with the problem of categorical values by creating meaningful embeddings for them, where by embedding, we mean dense numerical vectors. Currently, the embeddings are mostly computed by the use of neural networks. However, neural embeddings are used in supervised tasks, and they need a target variable to check if the embeddings are correct. The idea to remove the need for en explicit target variable is to use an autoencoder model, which computes a loss by comparing the reconstruction from a model with the original input data. The modern methodology that we present contains a modification of the recent AutoEmbedder model [6].

Many articles have been dedicated to the concept of customer segmentation. The majority are classical statistical clustering methods, with the most common method, k-means, at the forefront. However, there are also other methods such as conjoint analysis [7], random forests [8], principal component analysis (PCA) [9], logistic regression [10] or even artificial neural networks [11]. For an extensive overview of the methods, see [12,13].

Section 2 of the article describes the datasets examined, followed by a detailed presentation of the methods used. The applicability of the algorithms based on graph theory against the classical k-means method and the more modern DBSCAN algorithm is presented. Section 3 provides a detailed description of the results obtained. Section 4 includes a summary discussion, whereas Section 5 is dedicated to conclusions.

2. Materials and Methods

2.1. Materials

Two datasets were selected for evaluation. The first contained data on retail customers and was obtained from [14]. It is a set that subjects itself well to clustering. In comparison, the second set containing wholesale customer data from [15] is a set in which clustering makes no substantive sense.

2.1.1. Set Containing Data on Retail Customers

The dataset consists of 2240 customers. These customers are characterized by characteristics such as the customer’s ID, the customer’s birth year, the customer’s education level, the customer’s marital status, the customer’s annual household income, the number of children in the customer’s household, the number of teenagers in the customer’s household, the number of days since the last purchase made by the customer, the fact that the customer has complained, the amount of money the customer has spent on wine, fruit, meat, fish, candy, gold in the last 2 years, the number of purchases made at a discount, the use of six different campaign offers, the number of purchases made through the website, using a catalogue or directly in the store by the customer and the number of visits to the website by the customer in the last month. The majority of the dataset consists of married and highly educated customers.

2.1.2. Set Containing Data on Wholesale Customers

The set contains information on 701 business resellers, characterized by the reseller’s ID, business type, number of employees in the company, frequency of orders (defined by categories), the month of ordering, year of first ordering, year of most recent ordering, most frequently ordered product line, annual sales, type and amount of minimum payment, annual turnover, country name, city name, state or province name, the ratio of the sales value to the number of products ordered in the categories of bicycles, clothing, accessories, the ratio of the sales value to the number of products ordered in the various categories of promotions and without promotions, the average number of days to purchase a new product after it enters the market, the minimum number of days to purchase a new product after it enters the market, the maximum number of days to purchase a new product after it enters the market, the average delivery cost, the standard deviation of the delivery cost, the minimum delivery cost and the maximum delivery cost. Based on the City and StateProvinceName variables, information about the city’s population and the region’s GDP per capita was added to the set.

2.2. Methodology

The aim of this subsection is a detailed description of the methods we used in the following part. Specifically, the cluster analysis is outlined, and the classical k-means method, the DBSCAN method and three graph clustering algorithms (multilevel, greedy, label propagation) are described. Furthermore, the methods for determining the optimal number of clusters and tools for assessing the quality of the resulting clustering are presented.

2.2.1. Cluster Analysis

Cluster analysis aims to divide a dataset

O = \{x_{i} = (x_{i 1}, \dots, x_{i n}) : i = 1, \dots, N\},

composed of n-dimensional feature vectors

x_{i}

into clusters (subsets, groups, classes) with respect to a certain criterion [16]. The division must meet two assumptions:

Homogeneity within the groups (elements forming the group should be as similar as possible);
Heterogeneity between clusters (elements from different clusters should be as dissimilar as possible).

We obtain a division of the set O consisting of N elements into k clusters and the matrix

B_{k \times N}

, whose successive elements

b_{i c}

are the belonging degrees of the element

x_{c}

to the group

G_{i}

. In the classical cluster analysis, a hard split is used in which each element

x_{i}

belongs to only one group. Due to the clustering methods, cluster analysis algorithms are most commonly divided into hierarchical and nonhierarchical.

2.2.2. k-Means Clustering

The k-means algorithm is one of the most widely used clustering algorithms [16]. It clusters the data by attempting to separate the observations into k groups to minimize the sum of squared distances between each point and its nearest cluster centre. Cluster centres are called centroids.

The k-means algorithm can be described in four steps:

1.: Choose k initial centroids: $C = c_{1}, c_{2}, \dots, c_{k}$ .
2.: For each $i \in \{1, \dots, k\}$ , choose a cluster $D_{i}$ as a subset of all points that are closest to a given centroid $c_{i}$ .
3.: For each $i \in \{1, \dots, k\}$ , set a new centroid $c_{i}$ for all points in cluster $D_{i}$ as

$c_{i} = \frac{1}{|D_{i}|} \sum_{x ϵ D_{i}} x .$
4.: Repeat steps 2 and 3 until the set of $C$ does not change.

The method depends on the initiation of centroids. The most commonly used distance measures are:

The Euclidean distance

$d (x, y) = \sqrt{\sum_{i = 1}^{k} {(x_{i} - y_{i})}^{2}};$
The Manhattan distance

$d (x, y) = \sum_{i = 1}^{k} |x_{i} - y_{i}|;$
The Chebyshev distance

$d (x, y) = max_{i = 1, \dots, k} |x_{i} - y_{i}|;$
The Minkowski distance

$d (x, y) = \sqrt[p]{\sum_{i}^{k} {(x_{i} - y_{i})}^{p} .}$

Although the k-means algorithm is widely used, one must remember that it actually always partitions the data, regardless of its properties. One can use the k-means algorithm to partition uniform noise into k clusters in the sense of minimizing the Euclidean or other distances, without having a requirement for the clusters to be meaningful.

2.2.3. DBSCAN Clustering

The DBSCAN (density-based spatial clustering of applications with noise) algorithm [17] was designed to extract clusters and noise in a spatial data set. This algorithm has two parameters:

E p s

and

M i n P t s

, where

E p s

defines the distance from a given point p such that each point q, for which

d (p, q) < E p s

, belongs to the neighbourhood of point p, denoted by

N_{E p s} (p)

;

M i n P t s

determines the minimum number of points that must be in the cluster.

In a cluster, one can distinguish between two types of point: core points within the cluster and border points at the edge of the cluster. Two concepts are introduced regarding points in the cluster: directly density-reachable and density-reachable.

The point p is directly density-reachable from point q at a given

E p s

and

M i n P t s

, if

1.: $p \in N_{E p s} (q)$ ;
2.: $| N_{E p s} (q) | \geq M i n P t s$ (core points condition).

The direct density reachability is symmetrical for pairs among core points, but is not symmetrical for mixed pairs of core and border points. The point p is density-reachable from point q at a given

E p s

and

M i n P t s

, if there is a chain of points

p_{1}, p_{2}, \dots, p_{n}

,

p_{1} = q

and

p_{n} = p

, such that

p_{i} + 1

is directly density-reachable from point

p_{i}

. This relationship is transitive but not symmetrical. The two border points of a cluster are not within their density reachability. Therefore, the concept of density-connected points is also introduced.

The point p is density-connected to the point q at a given

E p s

and

M i n P t s

, if there is a point o such that both points p and q are density-reachable from point o at a given

E p s

and

M i n P t s

.

Having introduced the above definitions, we can move on to a description of the DBSCAN algorithm itself. To find a cluster, DBSCAN starts from an arbitrary point p and selects all points within its density reachability at a given

E p s

and

M i n P t s

. If p is a core point, this procedure yields a whole cluster. If p is a border point, then no point is density-reachable from p and the algorithm proceeds to the next point in the dataset.

Since global values for the

E p s

and

M i n P t s

parameters are used, the algorithm can merge two clusters into one if they are “close enough” to each other. Let the distance between sets

S_{1}

and

S_{2}

be defined as

d (S_{1}, S_{2}) = min d (p, q) : p \in S_{1}, q \in S_{2}

. Then, two point sets having at least the rarest cluster density will only be separated if the distance between them is greater than

E p s

. Consequently, it may be necessary to use the DBSCAN algorithm recursively for detected clusters with a higher

M i n P t s

value. However, this is not a disadvantage, as recursion leads to an elegant and efficient underlying algorithm. Moreover, recursive clustering is only necessary under a condition that is easy to verify.

The selection of

E p s

and

M i n P t s

can be conducted by various methods, including checking the values of metrics available for unsupervised clustering, such as the elbow criterion or the silhouette criterion. The advantage of the DBSCAN algorithm is its applicability to spatial data and not having to determine the number of clusters in advance.

A more recent improvement to this algorithm is HDBSCAN [18], which allows clusters of varying density. The algorithm works similarly to the original DBSCAN method but does not have an

e p s

parameter responsible for the cluster radius. Only the

m i n_c l u s t e r_s i z e

parameter defining the minimum number of points in a cluster is responsible for the size of the clusters in this method. Both algorithms usually work well on data with noise, although the newer HDBSCAN algorithm is usually more efficient, and different cluster densities often produce a better solution. Since we obtained sufficient results with the DBSCAN method in the experimental part, we concluded that it was not necessary to compare it with the improved version of the algorithm.

2.2.4. Optimal Cluster Number Selection Methods

The clustering algorithms belong to the unsupervised learning group, which means that it is impossible to compare the results with the original classes to verify the clustering accuracy. Therefore, for this task, a number of coefficients exist to assess the clustering quality based on differences between observations in the same groups and between observations from different groups. At the outset, two commonly used methods are presented: the elbow method and the silhouette method.

The elbow method is based on measuring the within-cluster sum of squares errors (

W S S

) for a varying cluster number k and then selecting a value k, for which the change in

W S S

starts to decrease. The

W S S

is therefore defined as:

W S S = \sum_{i = 1}^{n} {(x_{i} - \bar{x})}^{2},

with

x_{i}

representing the ith row of a data set having n rows, and

\bar{x}

representing the average value in the cluster.

The idea behind the elbow method is that the explained variability changes rapidly for a small number of clusters and then slows down, resulting in an elbow in the curve [19]. The point at which the so-called elbow is formed indicates the number of clusters that should be optimal in a given clustering algorithm.

The silhouette coefficient is a slightly stricter method than the previously described one, but it is better at identifying relationships within and between groups. For consecutive observations i belonging to the analysed data, the coefficient is defined as [20]

S (i) = \frac{b (i) - a (i)}{m a x (a (i), b (i))},

whereby

b (i)

is the smallest mean observation distance i to all points in any other cluster, and

a (i)

is the mean distance i from all other points in the same group. For example, assuming we have three clusters

A

,

B

and

C

and observation i belongs to cluster

C

, then

b (i)

is calculated by measuring the mean distance i from each point in cluster

A

, the mean distance i from each point in cluster

B

and takes the smallest value. The silhouette coefficient for the dataset is the mean of the silhouette coefficient for individual points.

This measure indicates whether individual points are correctly assigned to their clusters. When using this coefficient, we can use the following interpretations:

$S (i)$ close to 0 means that the point is between two clusters.
If $S (i)$ is close to −1, then the studied point should be in a different cluster.
If $S (i)$ is close to 1, then the point belongs to the “correct” cluster.

The described methods were supplemented by two less classic ones.

The Calinski–Harabasz index is based on the assumption that clusters are very compact and are sufficiently distant from each other to form a good distribution. The index is calculated by dividing the variance of the sums of squares of the distances of individual objects to their cluster centre by the sum of squares of the distance between the cluster centres. The higher that index value, the better the division. The formula for the Calinski–Harabasz index is defined as [21]:

C H_{k} = \frac{B C S M}{k - 1} \times \frac{n - k}{W C S M},

where k is the number of clusters formed from a set of n observations,

B C S M

(between cluster scatter matrix) determines the separation between groups and

W C S M

(Within cluster scatter matrix) determines the compactness within clusters. A higher value of the CH index means the clusters are dense and well-separated, although there is no “acceptable” cut-off value. We need to choose the solution which gives a peak or at least an abrupt elbow on the line plot of CH indices. On the other hand, if the line is smooth (horizontal, ascending or descending) then there is no such reason to prefer one solution over others.

Similarly to the silhouette method and the Calinski–Harabasz method, the Davies–Bouldin method tries to focus on both the separation and compactness of the clusters. This is due to the statement given in the maximum measure formula below repeatedly selecting values where the mean point is furthest from its centre and where the centres are closest to each other. However, in contrast to the silhouette and Calinski–Harabasz indexes, when the DB index decreases, it means an improvement in clustering. The Davies–Bouldin index is defined as [22]:

D B = \frac{1}{k} \sum_{i = 1}^{k} max_{j \neq i} (\frac{σ_{i} + σ_{j}}{d (c_{i}, c_{j})}),

where k is the number of clusters, while

σ_{i}

is the mean distance of all points in the ith cluster from the

c_{i}

cluster centre, and

d (c_{i}, c_{j})

is the distance between the ith and jth cluster centroids.

The application of this method is similar to that of the CH index.

2.2.5. Graph Model

A graph is a mathematical structure

G = (V, E)

consisting of a nonempty set V of objects called vertices and a family E of two-element sets of vertices called edges. These sets represent the presence of some relations between two vertices. For this purpose, we can assume that clients are vertices and two clients are joined by an edge if their features or shopping habits are similar. The graph of clients can be constructed in different ways depending on the structure of a data set. If only the information on products that were bought by specific clients is available, then we construct firstly an auxiliary graph of clients and products, where clients are joined with the product they bought. On this basis, we construct the main graph where two clients are joined with an edge if they bought the same product.

When looking for clusters, we are looking for similar clients. Firstly, we must define what it means for two clients to be similar. Similarity

s i m (x, y)

is a function on the set of clients fulfilling the following conditions:

s i m (x, y) = 1

(or maximum similarity) if only

x = y

and

s i m (x, y) = s i m (y, x)

for all pairs

(x, y)

. There is no one universal similarity function (see e.g., [23]). For the purpose of this research, the cosine similarity was applied [24].

Let

x = (x_{1}, x_{2}, \dots, x_{n})

and

y = (y_{1}, y_{2}, \dots, y_{n})

be the vectors one wishes to compare. The similarity function expressed as a cosine measure is of the form:

s i m (x, y) = \frac{x \times y}{\sqrt{\sum_{i = 1}^{n} x_{k}^{2}} \sqrt{\sum_{i = 1}^{n} y_{k}^{2}}},

where

x \times y

is the scalar product of the vectors x and y.

The measure calculates the cosine of the angle between x and y. A cosine value of 0 means that the two vectors are at an angle of

90^{\circ}

to each other (orthogonal) and are not similar. The closer the cosine value is to 1, the smaller the angle and the greater the similarity of the vectors.

Using the similarity function, a so-called similarity matrix can be constructed, where each entry is a value of the similarity function for an appropriate pair of customers. Based on that matrix, a customer graph can be constructed, in which an edge is placed between customers if their similarity function is sufficiently large. This cut-off threshold is a metaparameter of the construction. It may vary and depends on the data set. It is convenient that the output graph is connected, which means that there is a connection (not necessarily direct) between each pair of vertices.

2.2.6. Communities in Graphs

To search for a community in a graph is to divide the graph

G = (V, E)

into disjoint subgraphs, which are characterized by a large number of edges inside the community and a small number outside. The structure of a community

A = A_{1}, \dots, A_{R}

on the vertex set V is the division of the set V into a sequence of R subsets such that:

⋃_{i = 1}^{R} A_{i} = V

and

A_{i} \cap A_{j} = \emptyset

, with

i \neq j

,

i, j \in 1, 2, \dots, R

. Subsets

H_{i}

are such that

V (H_{i}) = A_{i}

are just communities in the graph.

There are many algorithms that create the aforementioned divisions. One of the parameters often used for this purpose is modularity [25].

Modularity is a measure that takes into account the degree distribution of vertices in a network and is used to measure the strength of a network’s division into modules. Networks with a high modularity have dense connections between vertices in groups and sparse connections between vertices of different modules.

Modularity is the fraction of edges falling into given groups minus the expected fraction if the edges were randomly distributed. The value of this function is in the range

[- 1, 1]

. It is positive when the number of edges in the groups exceeds the number expected based on random selection. A modularity greater than 0.3 is assumed to mean that there are significant community divisions. The value 0.3 is not obligatory, but useful and easy to check.

There are a number of clustering algorithms. We provide a short description of those used in this research. We refer the reader to [26] for more information.

2.2.7. Multilevel (Louvain) Algorithm [27]

1.: Initially, each node is assigned to a different cluster. The number of clusters is equal to the number of nodes.

For each neighbour j of node i, it is checked whether the total modularity increases by moving node i from its cluster to the cluster of node j. Node i is moved to the cluster of node j for which the modularity increment is the largest (positive). If none of the observed increments is positive, node i remains in its cluster.

Step two is repeated sequentially for all nodes. This is called a single iteration. The iteration is also repeated until modularity improvement cannot be achieved. Note that a node can be and usually is visited more than once.

2.: The second step is to rebuild the new network by grouping the nodes labelled in the same community (merging individual nodes) in step 1.

The edge weights between two new communities (containing more than 1 node) are determined by summing the edge weights from each node in one community to the other. From nodes in the same community, the edges form loops. The diagonal elements in the new neighbourhood matrix are nonzero.

One iteration of steps 1 and 2 is called a transition. After the first transition, the algorithm attempts a second transition and so on. The number of communities decreases after each transition. This continues until no changes in the communities are observed and maximum modularity is reached.

2.2.8. Greedy Algorithm [28]

The fast greedy algorithm involves optimizing the modularity factor. Initially, the vertices are not grouped and each vertex is a single community. Calculating the modularity improvement for the entire community and using the neighbourhood matrix makes it time-consuming to find the pair with the largest modularity increase (as in the multilevel algorithm). Instead, the expected modularity improvement is calculated for each pair of vertices. According to the modularity assumption, those pairs that achieve the greatest modularity improvement are combined into a new group. This procedure is repeated as long as the new mergers result in an increase in modularity.

2.2.9. Label Propagation Algorithm [29]

At the beginning of a given step, nodes have a label indicating the community to which they currently belong. Community membership changes based on the labels that neighbouring nodes have. This change is determined by the maximum number of labels within a node’s neighbours. Initially, each node is initialized with a unique label, and then the labels spread through the network. Consequently, densely connected groups quickly reach a common label. The main step is as follows

For each

x \in X

from the sequence, let

C_{x} (t) = f (C_{x_{i_{1}}} (t), \dots, C_{x_{i_{m}}} (t), C_{x_{i_{(m + 1)}}} (t - 1), \dots, C_{x_{i_{k}}} (t - 1)),

where

C_{x_{i_{1}}}, \dots, C_{x_{i_{k}}}

are the neighbours of x, of which m obtained a label in the current step, and the others have labels from the previous step. The function f returns the label occurring with the highest frequency among its neighbours. If there are multiple labels with the highest frequency,

C_{x} (t)

is chosen randomly among them.

2.2.10. Community Comparison

In general, to perform proper clustering of a particular graph, partitioning is carried out using various algorithms. One of the parameters for evaluating partitioning is the aforementioned modularity.

To compare splits among themselves, the normalized mutual information parameter

NMI

(normalized mutual information) [30] can be used.

Let

A = A_{1}, \dots, A_{R}

;

B = B_{1}, \dots, B_{S}

denote two divisions of V into communities. A common part of A and B can be expressed as a contingency matrix C of size

R \times S

, where

C_{i j}

denotes the number of common vertices of

A_{i}

and

B_{j}

(

C_{i j} = | A_{i} \cap B_{j} |

).

The normalized mutual information parameter is defined as:

NMI (A, B) = \frac{- 2 \sum_{i = 1}^{R} \sum_{j = 1}^{S} C_{i j} \log (\frac{C_{i j} n}{C_{i .} C_{. j}})}{\sum_{i = 1}^{R} C_{i .} \log (\frac{C_{i .}}{n}) + \sum_{j = 1}^{S} C_{. j} \log (\frac{C_{. j}}{n})},

where

C_{i .} = \sum_{j = 1}^{n} C_{i j}

and

C_{. j} = \sum_{i = 1}^{n} C_{i j}

.

Parameter

NMI

has its values in the interval

[0, 1]

. If communities A and B are identical, then

NMI (A, B) = 1

, in the lack of similarity

NMI (A, B) = 0

.

So, if the partition in clusters is repeated or similar in several algorithms, it can be considered appropriate.

The Fowlkes–Mallows Index (FMI) can also be used to compare the similarity of two clusterings [31]. It is defined as the geometric mean of precision and recall:

F M I = \frac{T P}{\sqrt{(T P + F P) (T P + F N)}},

where

T P

is the number of pairs of points that belong to the same groups in both clusterings,

F P

the number of pairs of points that belong to the same groups in the first clustering but not in the second,

F N

the number of pairs of points that belong to the same groups in the second clustering but not in the first.

The index takes values between 0 and 1. The higher the value, the more similar the two clusterings.

The Table 1 lists the main notations that appear in Section 2.

2.2.11. Proposed Methodology

Clustering comparisons were made on two datasets: resellers and customers. First, embeddings for categorical variables were generated with the use of an autoencoder. An autoencoder is a neural network designed to encode input data in the form of the so-called hidden representation and then decode them in such a way that the reconstructed data are as similar to the original ones as possible. It is worth noting that most embedding generation methods require knowledge of the target variable. Using an autoencoder removes that need, since the loss function is calculated by comparing the reconstruction with the input data. A diagram of how an autoencoder combined with an embedding layer, known as AutoEmbedder, works is shown in Figure 1. A similar usage of autoencoders can be seen in [6,32]. In the case of continuous variables, a transformation to a categorical variable was performed before their inclusion in the model by determining classes (ranges) of variable values. Embeddings were then generated for the categories using the neural network’s embedding layer. The embeddings were the input data for the autoencoder.

AutoEmbedder was trained on the training set. Once it had been trained, the embedding layer and encoder were used to generate the embeddings. A principal component analysis (PCA) [33] was then performed on the coded data, assuming that 95% of the variance remained explained. After that procedure, each observation was in the form of a real number vector. For the observations coded as described above, with the selection of an appropriate similarity measure, we generated a similarity matrix. Further on, we proceeded to select clustering algorithms, including graph algorithms. The methodology is graphically illustrated by Figure 2.

As a similarity metric, a cosine similarity scaled to a range from 0 to 1 was chosen.

The following algorithms were used to find groups of similar customers: k-means, DBSCAN, Louvain community, greedy modularity, label propagation. The last three are graph methods. For their implementation, a graph needed to be created to represent the relationship between customers in the set. When building a graph representing the relationship between the most similar customers, it was assumed that the graph was consistent. The graph was constructed based on the similarity matrix. Using the cut-off parameter, the creation of edges between similar clients was controlled. An edge was created between clients for which the similarity between them was greater than or equal to the set cut-off parameter. The cut-off parameters we tested were

0.25

,

0.5

and

0.75

.

The resulting clusters were compared using metrics available for unsupervised clustering. Calinski–Harabasz and Davies–Bouldin indices were calculated for the clusterings, which helped to select the most optimal clustering. In order to compare how similar the two clustering data were, the NMI parameter (normalized mutual information) and the Fowlkes–Mallows index could be used. Moreover, for graph methods, one could compare their modularity. When determining the optimal cluster number in the k-means algorithm, we used the results of the elbow and silhouette methods and the Calinski–Harabasz and Davies–Bouldin index values. In DBSCAN clustering, the parameters were selected based on the silhouette criterion.

2.2.12. Computational Complexity

Firstly, AutoEmbedder was trained on the training dataset. Then, to encode categorical variables, only the embedding and encoder parts of AutoEmbedder were used.

If there were p categorical features given, and each of them had

k_{i}

levels (

i = 1, 2, \dots, p

), then the embedding layer generated

k_{i} + 1

values for each layer, resulting in a vector of size

\sum_{1}^{p} (k_{i} + 1)

for each of the n samples in the dataset. The encoder consisted of dense layers with a ReLU activation function. If there were l layers with

l_{i}

(

i = 1, 2, \dots, l

) nodes each, the complexity of calculating the prediction for one sample in each dense layer would be

O (l_{i} \times l_{i - 1})

, where

l_{0}

would be the size of the input—in the presented case, it was

\sum_{1}^{p} (k_{i} + 1)

because of the embedding layer put before the encoder.

Standard PCA involves

O (m i n {(p, n)}^{3})

operations, where n is the number of samples and p is the number of features for each sample [34]. In our case, since the number of samples was much higher than the number of features, the time to complete the PCA algorithm was

O (n^{3})

. The complexity of computing the similarity matrix was

O (n^{2})

. The clustering algorithms used had the following complexities:

The Louvain algorithm is believed to run in $O (m)$ time, where m is the number of edges in the graph [35],
In the greedy modularity algorithm, in each step, the pair of nodes/communities that, joined, increase modularity the most, become part of the same community, and then the modularity is calculated for the full network. This step requires $O (m)$ calculations. After merging communities, the adjacency matrix of the network is updated in a worst-case scenario in $O (n)$ time. Each merging event is repeated $n - 1$ times. Hence, the overall complexity is $O ((m + n) n)$ , or $O (n^{2})$ in a sparse graph [36],
The label propagation algorithm has $O (n)$ computational complexity [37],
The DBSCAN algorithm scans the whole dataset only one time and needs to calculate the distance of any pair of objects in the dataset. Hence, the worst possible computational complexity of the whole algorithm is $O (n^{2})$ [38]. In our case, the distances did not need to be computed, because we took distances as one minus the value from the similarity matrix.

From the above analysis, it can be seen that the worst possible computational complexity occurs with the PCA algorithm, where it amounted to

O (n^{3})

. Therefore, we could conclude that the worst possible complexity for the proposed methodology was

O (n^{3})

.

3. Results

3.1. Results for Retail Customers

AutoEmbedder did not generate any errors when coding the set. After using AutoEmbedder, the set contained 96 columns. Subsequently, after the application of the PCA, the number of columns decreased to 8. Similarities between customers were determined using the cosine similarity. The distribution of rescaled similarities is included in Figure 3.

For a cut-off parameter of 0.75 by Louvain’s method, three groups were obtained. There were 565 customers in group 0, 929 customers in group number 1 and 746 customers in group number 2. The greedy modularity and label propagation methods divided the set of customers into two groups. There were 1285 customers in group 0 and 955 customers in group 1 for the greedy modularity and label propagation methods. The resulting divisions using the two methods were similar. The value of the Fowlkes–Mallows index comparing the two divisions was about 0.98, while the NMI parameter was about 0.94. After changing the cut-off parameter to 0.5, Louvain’s method also returned three communities, but the resulting groups were no longer as evenly distributed as with a cut-off parameter of 0.75. Using the greedy modularity method, two groups were obtained. The first group included 1200 customers, while the second group included 1040 customers. The label propagation method detected only one community in the graph consisting of all customers. With a cut-off parameter of 0.25, Louvain’s method divided the set of customers into two groups. There were 1048 customers in group number 0 and 1192 customers in group 1. Two groups were also formed using the greedy modularity method. There were 1131 customers in group 0 and 1109 customers in group number 1. The label propagation method gathered all customers into one community. For the graph methods, we compared the divisions using the modularity parameter. The highest value of the modularity parameter was characterized by the division obtained by Louvain’s method (0.566) with a cut-off parameter of 0.75.

Using the DBSCAN method, we obtained three groups. There were 950 customers in group number 0, 831 customers in group 1 and 16 customers in group 2. Overall, 443 customers were not classified.

We compared the graph methods and the DBSCAN method with ordinary clustering methods (k-means method). The k-means method divided customers into two groups.

Using the Calinski–Harabasz and Davies–Bouldin indices, we compared the quality of the obtained clusters in each method. The best values for the Calinski–Harabasz and Davies–Bouldin indices were obtained for the greedy modularity method with a cut-off parameter of 0.5. For the k-means method, we obtained similar index values.

Using the NMI and Fowlkes–Mallows parameters, the similarity degree of the clusters obtained by the different methods was tested. The divisions obtained by different methods varied. The full results of the NMI and Fowlkes–Mallows indices are presented in Table 2 and Table 3, respectively.

The Calinski–Harabasz and Davies Bouldin index values for various methods and various cut-off parameters are presented in Table 4.

Table 5 presents the Modularity parameter values for different graph methods and different cut-off parameters. The highest value corresponded to the Louvain method with a cut-off parameter of 0.75.

The considered dataset was subject to clustering, in most cases into three groups. Unsurprisingly, the greatest modularity came with a cut-off parameter of 0.75. This level indicated a high similarity between vertices, so edges existed between the most similar customers. As the cut-off parameter decreased, more and more edges appeared in the graph, which could have a negative effect on the quality of the clustering.

The summary of all obtained groups and their sizes in each method is presented in Table 6. Please take note that group numbers in one division do not correspond to group numbers in other divisions.

3.2. Results for Resellers

The clustering algorithms on a data set of resellers were tested using AutoEmbedder. AutoEmbedder did not generate any errors when coding the set. After using AutoEmbedder, the set consisted of 132 columns. Then, the number of columns was reduced to 43 following the application of the PCA. The similarities between customers were determined using cosine similarity. The distribution of rescaled similarities is included in Figure 4.

The first cut-off point tested was 0.75. For this value, the obtained graph was inconsistent, and therefore no clustering was performed on it (the number of groups would be more than 400, with all observations being 701). For a cut-off parameter of 0.5, Louvain’s method returned three groups. There were 151 resellers in group 0, 215 resellers in group 1 and 335 resellers in group 2. Using the greedy modularity method, we obtained two groups with the following numbers: group 0—351 resellers, group 1—350 resellers. The label propagation method also detected two communities in the graph; 350 customers were clustered in group 0 and 351 customers in group 1. According to the Fowlkes–Mallows index value, it could be deduced that the clustering obtained by the greedy modularity and label propagation methods were identical, and the Louvain community clustering was similar to them (the index was 0.83). The NMI values confirmed that the greedy modularity and label propagation methods gave the same clusters. According to the NMI parameter, the similarity between the two clusterings and that obtained from the Louvain community method was slightly lower compared to those seen with the Fowlkes–Mallows parameter but still quite significant. For the greedy modularity and label propagation methods, we obtained a slightly higher value for the modularity parameter than for the Louvain method (around 0.3).

Subsequently, a test was conducted for a cut-off point of 0.25. With a cut-off parameter of 0.25, Louvain’s method divided the set of customers into two groups. There were 356 resellers in group 0 and 345 resellers in group 1. Two groups were likewise created using the greedy modularity method. There were 382 resellers in group 0 and 319 resellers in group number 1. The label propagation method yielded a single group bringing together all resellers. According to the Fowlkes–Mallows parameter, the groups obtained by the label propagation method were roughly similar to the groups from the other graph methods (the index value was around 0.71). The least similar groups were those obtained by the Louvain community and greedy modularity methods (value around 0.5). According to the NMI parameter, the clustering had no similarities at all. The full results of the NMI and Fowlkes–Mallows indices are presented in Table 7 and Table 8, respectively. For divisions using graph methods with a cut-off parameter of 0.25, we obtained poor values for the modularity parameter (below 0.1).

In the DBSCAN method, we only received one group. Most of the customers were not classified into it.

The Calinski–Harabasz and Davies–Bouldin index values for the different methods and cut-off parameters are included in Table 9.

The modularity values for different graph methods and different cut-off parameters are presented in Table 10. The table does not include parameter values for divisions with one group. This highest value of the Modularity parameter corresponds to the Greedy Modularity and Label Propagation method for a cut-off parameter of 0.5.

The dataset tested was not a good candidate for clustering, as indicated by the results obtained. Despite the use of advanced data processing methods and the use of various clustering algorithms, the data contained in this set were too homogeneous (or too diverse) for the clustering to be able to reasonably propose a division into groups. It was indicated by the modularity of the parameter being below the suggested value of 0.3.

The summary of all obtained groups and their sizes in each method is presented in Table 11. Please take note that group numbers in one division do not correspond to group numbers in other divisions.

4. Discussion

The presented methodology and research indicated how the set quality could be determined by considering the observation segmentation. Because of the data’s homogeneity or their excessive heterogeneity, not every set could be divided into groups in a meaningful way, even if advanced preprocessing (categorization, autoencoder, PCA) was used. The indices that determined the clustering quality were the modularity parameter, the Calinski–Harabasz index and the Davies–Bouldin index. In the future, selecting the optimal cut-off parameter could be automated by choosing the highest value (in the range

(0, 1))

at which the graph remains consistent.

Most clustering algorithms can work only for categorical data or only for numerical data. Before using such algorithms, it is necessary to preprocess the data, which includes the discretization of continuous variables or one-hot encoding of categorical variables. The discretization process always generates some loss of information. In that context, our methodology potentially could be improved by some other treatment of continuous variables.

Another potential problem is the presence of missing values. We did not encounter this problem in the two chosen datasets; however, it may be a common occurrence in real-life application of our framework. To overcome the problems of missing data, imputation can be used or the observations with missing data might be deleted. However, simply deleting data is not a good practice, as it can lead to the loss of important information. In addition, deleting missing observations can result in leaving too few objects in the dataset, especially when a large number of variables contain missing data. A novel approach to imputation of missing data in clustering tasks was shown in [39]. There, an innovative algorithm called k-CMM was presented. That algorithm combined imputing the missing data with the clustering algorithm. It was significantly better than many other algorithms especially in the case when the number of missing data increased in the dataset.

In our research, we encountered a problem with high-dimensional data, because of the embeddings produced by AutoEmbedder. It is known that a higher dimensionality in the context of clustering may have an effect on distance or similarity measures. In particular, most clustering techniques depend critically on the measure of distance or similarity, and require that the objects within clusters are, in general, closer to each other than to objects in other clusters [5]. To minimize the problems connected to high dimensionality, we used PCA to reduce the number of features.

However, one would like to highlight that despite the use of sophisticated methods, the set of resellers did not tend to cluster. It is important to note that not all markets were clearly segmented. Customised tools were also provided to verify this issue. The set of retail customers was homogeneous due to some characteristics. It was important to choose a cut-off parameter that would ensure the consistency of the graph and at the same time allow us to extract the most heterogeneous groups of customers. The best clustering was provided by the Louvain method with a cut-off parameter of 0.75. With a cut-off parameter of 0.75, Louvain’s method divided the set of customers into three heterogeneous groups.

5. Conclusions

A proper grouping of customers can be helpful in making appropriate marketing decisions that match the profile of a given target group. Extracting clusters and analysing their characteristics can allow a better matching of advertising campaigns, promotions, newsletters, loyalty programs and many other activities. In future work, it is planned to create an automated marketing decision support system which is similar to a hybrid recommendation system. In an ordinary recommendation system, the relationships between users (customers) and products are explored. Then, customers are recommended the most suitable products for them. The marketing recommendation system is planned as a functionality for the company’s management or for the marketing department. With such a system, the user would receive recommendations of marketing activities for a group of similar customers.

As was said in the discussion section, we detected some potential problems such as missing values and different ways to encode categorical variables. In future work, we would like to propose a methodology that solves those issues. In future work, it would also be worth checking if using different clustering methods would render significantly differing results. One of the potential methods to use is HDBSCAN, as an improvement of the DBSCAN method.

Author Contributions

Conceptualization, T.C., I.G., Ł.S. and T.R; methodology, I.G., A.R. and E.P; software, A.R. and E.P.; validation, T.C, I.G. and D.P.; formal analysis, A.R. and E.P.; investigation, A.R. and E.P; resources, T.C.; data curation, A.R. and E.P; writing—original draft preparation, E.P., A.R. and I.G; writing—review and editing, I.G., T.R. and T.C.; visualization, E.P. and A.R.; supervision, T.C.; project administration, T.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

AdventureWorksDW_2016EXT, available online at https://learn.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver15&tabs=ssms (accessed on 5 May 2022). Kaggle Customer Personality Analysis Set, available online at https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis (accessed on 10 May 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PCA	Principal component analysis
DBSCAN	Density-based spatial clustering of applications with noise
HDBSCAN	Hierarchical-density based spatial clustering of applications with noise
NMI	Normalized mutual information index
CH index	Calinski–Harabasz index
DB index	Davies–Bouldin index
L.25	Louvain’s method based on graph with cut-off parameter 0.25
L.5	Louvain’s method based on graph with cut-off parameter 0.5
L.75	Louvain’s method based on graph with cut-off parameter 0.75
GM.25	Greedy modularity method based on graph with cut-off parameter 0.25
GM.5	Greedy modularity method based on graph with cut-off parameter 0.5
GM.75	Greedy modularity method based on graph with cut-off parameter 0.75
LP.25	Label propagation method based on graph with cut-off parameter 0.25
LP.5	Label propagation method based on graph with cut-off parameter 0.5
LP.75	Label propagation method based on graph with cut-off parameter 0.75
DBS	Shorter form for DBSCAN method, used in tables with NMI and Fowlkes–Mallows indices
KM	Shorter form for k-means method, used in tables with NMI and Fowlkes–Mallows indices

References

Weinstein, A. Market Segmentation Handbook: Strategic Targeting for Business and Technology Firms, 3rd ed.; Haworth Press: Binghamton, NY, USA, 2004. [Google Scholar]
Huang, J.-J.; Tzeng, G.-H.; Ong, C.-S. Marketing segmentation using support vector clustering. Expert Syst. Appl. 2007, 32, 313–317. [Google Scholar] [CrossRef]
Myers, J.H. Segmentation and Positioning for Strategic Marketing Decisions. J. Acad. Mark. Sci. 1996, 28, 438. [Google Scholar]
Wedel, M.; Kamakura, W.A. Market Segmentation: Conceptual and Methodological Foundations, 2nd ed.; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2010; pp. 10–15. [Google Scholar]
Steinbach, M.; Ertöz, L.; Kumar, V. The Challenges of Clustering High Dimensional Data. In New Directions in Statistical Physics: Econophysics, Bioinformatics, and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2004; pp. 273–309. [Google Scholar]
Ohi, A.Q.; Mridha, M.F.; Safir, F.B.; Hamid, M.A.; Monowar, M.M. AutoEmbedder: A semi-supervised DNN embedding system for clustering. Knowl. Based Syst. 2020, 204, 106190. [Google Scholar] [CrossRef]
Desarbo, S.H.; Ramaswamy, W.S.; Cohen, V. Market segmentation with choice-based conjoint analysis. Mark. Lett. 1995, 6, 137–147. [Google Scholar] [CrossRef] [Green Version]
Perbert, F.; Stenger, B.; Maki, A. Random Forest Clustering and Application to Video Segmentation. BMVC 2009, 1–10. Available online: http://www.toshiba-europe.com/research/crl/cvg/ (accessed on 6 June 2022).
Minhas, R.S.; Jacobs, E.M. Benefit Segmentation by Factor Analysis: An improved method of targeting customers for financial services. Int. J. Bank Mark. 1996, 14, 3–13. [Google Scholar] [CrossRef]
Burinskiene, M.; Rudzkiene, V. Application of Logit Regression Models for the Identification of Market Segments. J. Bus. Econ. Manag. 2008, 8, 253–258. [Google Scholar] [CrossRef] [Green Version]
Fish, K.E.; Barnes, J.H.; Aiken, M.W. Artificial neural networks: A new methodology for industrial market segmentation. Ind. Mark. Manag. 1995, 24, 431–438. [Google Scholar] [CrossRef]
Garima, G.H.; Singh, P.K. Clustering techniques in data mining: A Comparison. In Proceedings of the 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 11–13 March 2015. [Google Scholar]
Ramasubbareddy, S.; Srinivas, T.A.S.; Govinda, K.; Manivannan, S.S. Comparative Study of Clustering Techniques in Market Segmentation. Innov. Comput. Sci. Eng. Lect. 2020, 103, 117–125. [Google Scholar]
Kaggle. Customer Personality Analysis Set. Available online: https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis (accessed on 28 July 2022).
AdventureWorksDW_2016EXT. Available online: https://learn.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver15&tabs=ssms (accessed on 18 July 2022).
Arthur, D.; Vassilvitskii, S. k-means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07), New Orleans, LO, USA, 7–9 January 2007. [Google Scholar]
Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD ’96), Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
Campello, R.J.G.B.; Moulavi, D.; Sander, J. Density-based clustering based on hierarchical density estimates. In Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Gold Coast, Australia, 14–17 April 2013; pp. 160–172. [Google Scholar]
Yuan, C.; Yang, H. Research on k-Value Selection Method of k-means Clustering Algorithm. J 2019, 2, 226–235. [Google Scholar] [CrossRef] [Green Version]
Kaoungku, N.; Suksut, K.; Chanklan, R.; Kerdprasop, K.; Kerdprasop, N. The Silhouette Width Criterion for Clustering and Association Mining to Select Image Features. Int. J. Mach. Learn. Comput. 2018, 8, 69–73. [Google Scholar] [CrossRef] [Green Version]
Wang, X.; Xu, Y. An improved index for clustering validation based on Silhouette index and Calinski-Harabasz index. In IOP Conference Series: Materials Science and Engineering; IOP Publishing: Bristol, UK, 2019; Volume 569. [Google Scholar]
Petrovic, S.V. A Comparison Between the Silhouette Index and the Davies-Bouldin Index in Labelling IDS Clusters. Comput. Sci. 2006. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=b2db00f73fc6b97ebe12e97cfdaefbb2fefc253b (accessed on 15 July 2022).
Ontañón, S. An overview of distance and similarity functions for structured data. Artif. Intell. Rev. 2020, 53, 5309–5351. [Google Scholar] [CrossRef] [Green Version]
Fortunato, S. Community detection in graphs. Physic Rep. 2010, 486, 75–174. [Google Scholar] [CrossRef] [Green Version]
Newman, M.E.J. Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 2006, 103, 8577–8582. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Yang, Z.; Algesheimer, R.; Tessone, C.J. A Comparative Analysis of Community Detection Algorithms on Artificial Networks. Sci. Rep. 2016, 6, 30750. [Google Scholar] [CrossRef] [Green Version]
Blondel, V.D.; Guillaume, J.-L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 10, P10008. [Google Scholar] [CrossRef] [Green Version]
Clauset, A.; Newman, M.E.J.; Moore, C. Finding community structure in very large networks. Phys. Rev. E 2004, 70, 66111. [Google Scholar] [CrossRef] [Green Version]
Cordasco, G.; Gargano, L. Community detection via semi-synchronous label propagation algorithms. In Proceedings of the 2010 IEEE International Workshop on: Business Applications of Social Network Analysis (BASNA), Bangalore, India, 15 December 2010; pp. 1–8. [Google Scholar]
Amelio, A.; Pizzuti, C. Is normalized mutual information a fair measure for comparing community detection methods? In Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Paris, France, 25–28 August 2015; pp. 1584–1585. [Google Scholar]
Fowlkes, E.B.; Mallows, C.L. A Method for Comparing Two Hierarchical Clusterings. J. Am. Stat. Assoc. 1983, 78, 553–569. [Google Scholar] [CrossRef]
Diallo, B.; Hu, J.; Li, T.; Khan, G.A.; Liang, X.; Zhao, Y. Deep embedding clustering based on contractive autoencoder. Neurocomputing 2021, 433, 96–107. [Google Scholar] [CrossRef]
Abdi, H.; Williams, L.J. Principal component analysis: Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Johnstone, I.M.; Lu, A.Y. Sparse Principal Components Analysis. arXiv 2009, arXiv:0901.4392. [Google Scholar]
Ozaki, N.; Tezuka, H.; Inaba, M. A Simple Acceleration Method for the Louvain Alogrithm. Int. J. Comput. Electr. Eng. 2016, 8, 207–218. [Google Scholar] [CrossRef] [Green Version]
Al-Mukhtar, A.F.; Al-Shamery, E.S. Al-Shamery, Greedy Modularity Graph Clustering for Community Detection of Large Co-Authorship Network. Int. J. Eng. Technol. 2018, 7, 857–863. [Google Scholar] [CrossRef] [Green Version]
Raghavan, U.N.; Albert, R.; Kumara, S. Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev. E 2007, 76, 036106-1–036106-11. [Google Scholar] [CrossRef] [Green Version]
Nidhi, P.K.A. An Efficient and Scalable Density-based Clustering Algorithm for Normalize Data. Procedia Comput. Sci. 2016, 92, 136–141. [Google Scholar] [CrossRef] [Green Version]
Dinh, D.T.; Huynh, V.N.; Sriboonchitta, S. Clustering mixed numerical and categorical data with missing values. Inf. Sci. 2021, 571, 418–442. [Google Scholar] [CrossRef]

Figure 1. Architecture of AutoEmbedder.

Figure 2. Proposed methodology of clustering data containing categorical variables.

Figure 3. Histogram of scaled cosine similarity for the dataset of individual customers.

Figure 4. Histogram of scaled cosine similarity for the dataset of resellers.

Table 1. Summary of main notations in the paper.

Notation	Explanation	Notation	Explanation
$x = (x_{1}, x_{2}, \dots, x_{n})$	A vector	n	A number of observations
$y = (y_{1}, y_{2}, \dots, y_{n})$	A vector	k	A number of clusters
G	A graph	$C_{x_{i_{1}}}, \dots, C_{x_{i_{k}}}$	The neighbours of x
V	The vertices of the graph G	$d (c_{i}, c_{j})$	Distance between the ith and jth cluster centroids
E	The edges of the graph G	$σ_{i}$	The mean distance of all points in the ith cluster from the $c_{i}$ cluster centre
$A = A_{1}, \dots, A_{R}$	A division of V into communities	$s i m (x, y)$	The value of cosine similarity
$B = B_{1}, \dots, B_{S}$	A division of V into communities	$c_{i}$	ith cluster centre
C	A contingency matrix of size $R \times S$	$b (i)$	The smallest mean observation distance i to all points in any other cluster
$C_{i j}$	A number of common vertices of $A_{i}$ and $B_{j}$	$a (i)$	The mean distance i from all other points in the same group.
$C = \{c_{1}, c_{2}, \dots, c_{k}\}$	The cluster centroids

Table 2. NMI index for retail customers clustering.

	L.25	L.5	L.75	GM.25	GM.5	GM.75	LP.25	LP.5	LP.75	DBS	KM
L.25	1	0.91	0.6	0.84	0.83	0.79	0	0	0.81	0.62	0.67
L.5	0.91	1	0.61	0.79	0.84	0.8	0	0	0.82	0.63	0.64
L.75	0.6	0.61	1	0.55	0.6	0.68	0	0	0.67	0.55	0.48
GM.25	0.84	0.79	0.55	1	0.76	0.7	0	0	0.71	0.58	0.73
GM.5	0.83	0.84	0.6	0.76	1	0.8	0	0	0.83	0.62	0.63
GM.75	0.79	0.8	0.68	0.7	0.8	1	0	0	0.93	0.64	0.59
LP.25	0	0	0	0	0	0	1	1	0	0	0
LP.5	0	0	0	0	0	0	1	1	0	0	0
LP.75	0.81	0.82	0.67	0.71	0.83	0.93	0	0	1	0.63	0.6
DBS	0.62	0.63	0.55	0.58	0.62	0.64	0	0	0.63	1	0.52
KM	0.67	0.64	0.48	0.73	0.63	0.59	0	0	0.60	0.52	1

Table 3. FMI index for retail customers clustering.

	L.25	L.5	L.75	GM.25	GM.5	GM.75	LP.25	LP.5	LP.75	DBS	KM
L.25	1	0.98	0.76	0.95	0.95	0.92	0.71	0.71	0.94	0.8	0.87
L.5	0.98	1	0.76	0.93	0.96	0.93	0.7	0.7	0.94	0.8	0.85
L.75	0.76	0.76	1	0.73	0.76	0.79	0.59	0.59	0.79	0.7	0.7
GM.25	0.95	0.93	0.73	1	0.92	0.87	0.71	0.71	0.88	0.78	0.9
GM.5	0.95	0.96	0.76	0.92	1	0.93	0.71	0.71	0.94	0.8	0.85
GM.75	0.92	0.93	0.79	0.87	0.93	1	0.71	0.71	0.98	0.8	0.8
LP.25	0.71	0.7	0.59	0.71	0.71	0.71	1	1	0.71	0.6	0.71
LP.5	0.71	0.7	0.59	0.71	0.71	0.71	1	1	0.71	0.6	0.71
LP.75	0.94	0.94	0.79	0.88	0.94	0.98	0.71	0.71	1	0.8	0.82
DBS	0.8	0.8	0.7	0.78	0.8	0.8	0.6	0.6	0.8	1	0.75
KM	0.86	0.85	0.7	0.9	0.85	0.8	0.71	0.71	0.82	0.75	1

Table 4. The Calinski–Harabasz (CH) and Davies–Bouldin (DB) index values for the different methods and different cut-off parameters (retail customers).

Cut-Off Parameter	Method	CH Index	DB Index
0.25	Louvain	801.783	1.626
0.25	Greedy	796.954	1.640
0.25	Label	-	-
0.5	Louvain	415.123	2.358
0.5	Greedy	782.355	1.644
0.5	Label	-	-
-	DBSCAN	314.034	2.484
-	k-means	249.258	2.873
0.75	Louvain	542.745	2.080
0.75	Greedy	773.752	1.624
0.75	Label	781.243	1.625

Table 5. Modularity values for different graph methods and different cut-off parameters (retail customers).

Cut-Off Parameter	Method	Modularity
0.75	Louvain	0.566
0.75	Greedy	0.463
0.75	Label	0.409
0.5	Louvain	0.309
0.5	Greedy	0.309
0.25	Louvain	0.205
0.25	Greedy	0.202

Table 6. Total number of groups and group sizes for each method (retail customers).

Method	Cut-Off Parameter	Group 0	Group 1	Group 2	Total Number of Groups
Louvain	0.75	565	929	746	3
Greedy	0.75	1285	955	-	2
Label	0.75	1285	955	-	2
Louvain	0.5	40	1185	1015	3
Greedy	0.5	1200	1040	-	2
Label	0.5	2240	-	-	1
Louvain	0.25	1048	1192	-	2
Greedy	0.25	1131	1109	-	2
Label	0.25	2240	-	-	1
k-means	-	1039	1201	-	2
DBSCAN	-	950	831	16	3

Table 7. NMI index for wholesale customers clustering.

	L.25	L.5	GM.25	GM.5	LP.25	LP.5	DBS	KM
L.25	1	0.72	0	0.95	0	0.95	0.21	0.05
L.5	0.72	1	0.29	0.7	0	0.7	0.16	0.12
GM.25	0	0.29	1	0	0	0	0	0.09
GM.5	0.95	0.7	0	1	0	1	0.21	0.05
LP.25	0	0	0	0	1	0	0	0
LP.5	0.95	0.7	0	1	0	1	0.21	0.05
DBS	0.21	0.16	0	0.21	0	0.21	1	0.05
KM	0.05	0.12	0.09	0.05	0	0.05	0.05	1

Table 8. FMI index for wholesale customers clustering.

	L.25	L.5	GM.25	GM.5	LP.25	LP.5	DBS	KM
L.25	1	0.84	0.5	0.99	0.71	0.99	0.64	0.56
L.5	0.84	1	0.55	0.84	0.61	0.84	0.59	0.47
GM.25	0.50	0.55	1	0.5	0.71	0.5	0.6	0.57
GM.5	0.99	0.84	0.5	1	0.71	1	0.64	0.56
LP.25	0.71	0.61	0.71	0.71	1	0.71	0.85	0.75
LP.5	0.99	0.84	0.5	1	0.71	1	0.64	0.56
DBS	0.64	0.59	0.6	0.64	0.85	0.64	1	0.62
KM	0.56	0.47	0.57	0.56	0.75	0.56	0.62	1

Table 9. Calinski–Harabasz (CH) and Davies–Bouldin (DB) index values for the different methods and different cut-off parameters (wholesale customers).

Cut-Off Parameter	Method	CH Index	DB Index
0.5	Louvain	44.240	3.785
0.5	Greedy	69.146	3.163
0.5	Label	69.146	3.163
-	DBSCAN	59.238	2.869
-	k-means	59.579	3.167
0.25	Louvain	68.947	3.167
0.25	Greedy	22.983	5.445
0.25	Label	-	-

Table 10. Modularity values for different graph methods and different cut-off parameters (wholesale customers).

Cut-Off Parameter	Method	Modularity
0.5	Louvain	0.279
0.5	Greedy	0.290
0.5	Label	0.290
0.25	Louvain	0.086
0.25	Greedy	0.053

Table 11. Total number of groups and group sizes for each method (wholesale customers).

Method	Cut-Off Parameter	Group 0	Group 1	Group 2	Total Number of Groups
Louvain	0.5	151	215	335	3
Greedy	0.5	351	350	-	2
Label	0.5	350	351	-	2
Louvain	0.25	1048	1192	-	2
Greedy	0.25	1131	1109	-	2
Label	0.25	356	345	-	2
k-means	-	382	319	-	2
DBSCAN	-	119	-	-	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rachwał, A.; Popławska, E.; Gorgol, I.; Cieplak, T.; Pliszczuk, D.; Skowron, Ł.; Rymarczyk, T. Determining the Quality of a Dataset in Clustering Terms. Appl. Sci. 2023, 13, 2942. https://doi.org/10.3390/app13052942

AMA Style

Rachwał A, Popławska E, Gorgol I, Cieplak T, Pliszczuk D, Skowron Ł, Rymarczyk T. Determining the Quality of a Dataset in Clustering Terms. Applied Sciences. 2023; 13(5):2942. https://doi.org/10.3390/app13052942

Chicago/Turabian Style

Rachwał, Alicja, Emilia Popławska, Izolda Gorgol, Tomasz Cieplak, Damian Pliszczuk, Łukasz Skowron, and Tomasz Rymarczyk. 2023. "Determining the Quality of a Dataset in Clustering Terms" Applied Sciences 13, no. 5: 2942. https://doi.org/10.3390/app13052942

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Determining the Quality of a Dataset in Clustering Terms

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.1.1. Set Containing Data on Retail Customers

2.1.2. Set Containing Data on Wholesale Customers

2.2. Methodology

2.2.1. Cluster Analysis

2.2.2. k-Means Clustering

2.2.3. DBSCAN Clustering

2.2.4. Optimal Cluster Number Selection Methods

2.2.5. Graph Model

2.2.6. Communities in Graphs

2.2.7. Multilevel (Louvain) Algorithm [27]

2.2.8. Greedy Algorithm [28]

2.2.9. Label Propagation Algorithm [29]

2.2.10. Community Comparison

2.2.11. Proposed Methodology

2.2.12. Computational Complexity

3. Results

3.1. Results for Retail Customers

3.2. Results for Resellers

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI