Next Article in Journal
Exploring Sentiment Analysis for the Indonesian Presidential Election Through Online Reviews Using Multi-Label Classification with a Deep Learning Algorithm
Previous Article in Journal
Exploring Perspectives of Blockchain Technology and Traditional Centralized Technology in Organ Donation Management: A Comprehensive Review
Previous Article in Special Issue
Identification of Emerging Technological Hotspots from a Multi-Source Information Perspective: Case Study on Blockchain Financial Technology
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unsupervised Decision Trees for Axis Unimodal Clustering

Department of Computer Science and Engineering, University of Ioannina, 45110 Ioannina, Greece
*
Author to whom correspondence should be addressed.
Information 2024, 15(11), 704; https://doi.org/10.3390/info15110704
Submission received: 5 September 2024 / Revised: 8 October 2024 / Accepted: 22 October 2024 / Published: 5 November 2024

Abstract

:
The use of decision trees for obtaining and representing clustering solutions is advantageous, due to their interpretability property. We propose a method called Decision Trees for Axis Unimodal Clustering (DTAUC), which constructs unsupervised binary decision trees for clustering by exploiting the concept of unimodality. Unimodality is a key property indicating the grouping behavior of data around a single density mode. Our approach is based on the notion of an axis unimodal cluster: a cluster where all features are unimodal, i.e., the set of values of each feature is unimodal as decided by a unimodality test. The proposed method follows the typical top-down splitting paradigm for building axis-aligned decision trees and aims to partition the initial dataset into axis unimodal clusters by applying thresholding on multimodal features. To determine the decision rule at each node, we propose a criterion that combines unimodality and separation. The method automatically terminates when all clusters are axis unimodal. Unlike typical decision tree methods, DTAUC does not require user-defined hyperparameters, such as maximum tree depth or the minimum number of points per leaf, except for the significance level of the unimodality test. Comparative experimental results on various synthetic and real datasets indicate the effectiveness of our method.

1. Introduction

Decision trees are popular machine learning models mainly due to their interpretability since their decision-making process can be directly represented as a set of rules. Each internal node in a decision tree contains a decision rule that determines the child node to be visited, while leaf nodes provide the final decision. In a typical decision tree, the decision rule of an internal node involves simple thresholding on a feature value, thus it is straightforward to interpret. From a geometrical point of view, typical decision trees (often called axis-aligned trees) partition the data space into hyperrectangular regions.
Decision trees have been widely employed for supervised learning tasks. The general approach for building decision trees follows a greedy top-down iterative splitting strategy of the original dataset into subsets. At each step, a decision rule is determined that splits the dataset under consideration so that a certain criterion is improved. For example, in the case of classification problems the criterion is related to the homogeneity of the resulting partition with respect to class labels. In this way partitions into subsets that contain data of the same class are preferred. Well-known supervised decision tree algorithms include Classification and Regression Trees (CART) [1], Iterative Dichotomiser 3 (ID3) [2], C4.5 [3] and Chi-squared Automatic Interaction Detector (CHAID) [4]. In CART the partition tree is built using a few binary conditions obtained from the original data features and uses the Gini index as a splitting criterion. ID3 and C4.5 (an improvement upon ID3), employ the class entropy measure as the splitting criterion, while CHAID uses the statistical χ 2 test to determine the best split during the tree-growing process. It should be stressed that in the supervised case, the use of cross-validation techniques enables the determination of appropriate values for the various hyperparameters of the methods (e.g., maximum tree depth, minimum number of leaves, etc.).

1.1. Decision Trees for Interpretable Clustering

Clustering methods aim at partitioning a set of points into groups, the clusters, such that data within the same group share common characteristics and differ from data in other groups. Most of the popular clustering algorithms, such as k-means, do not directly provide any explanation of the clustering result. To overcome this limitation, some research works propose the use of decision tree models for clustering in order to achieve explainability. Tree-based clustering methods return unsupervised binary trees that provide an interpretation of the data partitioning.
While in the supervised case, the construction of decision trees is relatively straightforward due to the presence of target information, this task becomes more challenging in the unsupervised case (e.g., clustering) where only data points are available. The difficulty arises for two reasons:
  • Definition of splitting criterion. Metrics like information gain or Gini index, which are commonly used to guide the splitting process in supervised learning, cannot be applied in unsupervised learning, since no data labels are available.
  • Specification of hyperparameters (e.g., number of clusters), since cross-validation cannot be applied.
Despite the apparent difficulties, several methods have been proposed to build decision trees for clustering. The category of indirect methods typically follows a two-step procedure: first, they obtain cluster labels using a clustering algorithm, such as k-means, and then they apply a supervised decision tree algorithm to build a decision tree that interprets the resulting clusters. For example, in [5], labels obtained from k-means are used as a preliminary step in tree construction. Similarly, in [6], the centroids derived from k-means are also involved in splitting procedures. Indirect methods heavily rely on the clustering result of their first stage. Moreover, fitting the cluster labels with an axis-aligned decision tree may be problematic since the clusters are typically of spherical or ellipsoidal shape. It is also assumed that the number of clusters is given by the user.
Direct methods integrate decision tree construction and partitioning into clusters. Many of them follow the typical top-down splitting procedure used in the supervised case but exploit unsupervised splitting criteria, e.g., compactness of the resulting subsets. Some direct unsupervised methods are described below.
In [7] a top-down tree induction framework with applicability to clustering (Predictive Clustering Trees) as well as to supervised learning tasks is proposed. It works similarly to a standard decision tree with the main difference being that the variance function and the prototype function, used to compute a label for each leaf, are treated as parameters that must be instantiated according to the specific learning task. The splitting criterion is based on the maximum separation (inter-cluster distances) between two clusters, while after the construction of the tree, a pruning step is applied using a validation set.
In [8] four measures for selecting the most appropriate split feature and two algorithms for partitioning the data at each decision node are proposed. The split thresholds are computed either by detecting the top k 1 valley points of the histogram along a specific feature or by considering the inhomogeneity (information content) of the data with respect to some feature. Distance-related measures and histogram-based measures are proposed for selecting an appropriate split feature. For example, the deviation of a feature histogram from the uniform distribution is considered (although it depends on the bin size).
In [9] an unsupervised method is proposed, called Clustering using Unsupervised Binary Trees (CUBT), which achieves clustering through binary trees. This method involves a three-stage procedure: maximal tree construction, pruning, and joining. First, a maximal tree is grown by applying recursive binary splits to reduce the heterogeneity of the data (based on the input’s covariance matrices) within the new subsamples. Next, tree pruning is applied using a criterion of minimal dissimilarity. Finally, similar clusters (leaves of the tree) are joined, even if they do not necessarily share the same direct ascendant. Although CUBT constructs clusters directly using trees, it relies on several parameters throughout the three-stage process, while post hoc methods are required to combine leaves into unified clusters, which adds to the complexity and parameter dependency of the approach.
An alternative method for constructing decision trees for clustering is proposed in [10]. At first, noisy data points (uniformly distributed) are added to the original data space. Then, a standard (supervised) decision tree is constructed by classifying both the original data points and the noisy data points under the assumption that the original data points and the noisy data points belong to two different classes. A modified purity criterion is used to evaluate each split, in a way that dense regions (original data) as well as sparse regions (noisy data) are identified. However, this method requires additional preprocessing through the introduction of synthetic data in order to create the binary classification setting.
In contrast to axis-aligned trees, oblique trees allow test conditions that involve multiple features simultaneously, enabling oblique splits across the feature space. In [11] oblique trees for clustering are proposed, where each split is a hyperplane defined by a small number of features. Although oblique trees can produce more compact trees, finding the optimal test condition for a given node can be computationally expensive, while they may not always be interpretable [12].
An interesting direct approach [13] exploits the method of Optimal Classification Trees (OCT) [14], which are built in a single step by solving a mixed-integer optimization problem. Specifically, in [13] the Interpretable Clustering via Optimal Trees (ICOT) algorithm is presented, where two cluster validation criteria, the Silhouette Metric [15] and the Dunn Index [16] are chosen as objective functions. The ICOT algorithm begins with the initialization of a tree, which serves as the starting point. Two options are provided for a tree initialization: either a greedy tree is constructed or the k-means is used as a warm-start algorithm to partition the data into clusters and then OCT is used to generate a tree that separates these clusters. Next, ICOT runs a local search procedure until the objective value (Silhouette Metric or Dunn Index) reaches an optimum value. This process is repeated from many different starting trees, generating many candidate clustering trees. The final tree is chosen as the one with the highest cluster quality score across all candidate trees and is returned as the output of the algorithm. ICOT is able to handle both numerical and categorical features as well as mixed-type features efficiently, by introducing an appropriate distance metric. Although it performs well on very small datasets and trees, it is slower compared to other methods. In addition, there exist hyperparameters that have to be tuned by the user, such as the maximum depth of the tree and the minimum number of observations in each cluster.

1.2. Unimodality-Based Clustering

Data unimodality is closely related to clustering since unimodality indicates grouping behavior around a single mode (peak). A distribution that is not unimodal is called multimodal with two or more modes. Assessing data unimodality means estimating whether the data have been generated by an arbitrary unimodal distribution. Recognizing unimodality is fundamental for understanding data structure; for instance, clustering methods are irrelevant for unimodal data as they form a single coherent cluster [17].
A few methods exist for assessing the unimodality of univariate datasets such as the well-known dip-test for unimodality [18] and the more recent UU-test [19]. It should be noted that those tests are applied to univariate data only. To decide on the unimodality of a multidimensional dataset, some techniques have been proposed that apply the dip-test on univariate datasets containing for example 1-d projections of the data or distances between data points. Several have been also used in the case of multidimensional datasets. Unimodality assessment has been used either in a top-down fashion, by splitting clusters that are decided as multimodal [20,21], or in a bottom-up fashion by merging clusters whose union is decided as unimodal [22]. In addition to intuitive justification, the use of unimodality provides a natural way to terminate the splitting or merging procedure, thus allowing for the automated estimation of the number of clusters. Since those methods provide ellipsoidal or arbitrarily shaped clusters the results are not interpretable. We aim to tackle this issue by proposing a unimodality-based method for the construction of decision trees for clustering.

1.3. Contribution

Our approach is based on the notion of an axis unimodal cluster: a cluster where all features are unimodal, i.e., the set of values of each feature is unimodal as decided by a unimodality test. The proposed method, called Decision Trees for Axis Unimodal Clustering (DTAUC), follows the typical top-down splitting paradigm for building axis-aligned decision trees and aims to partition the initial dataset into axis unimodal clusters. The decision rule at each node involves an appropriately selected feature and the corresponding threshold value. More specifically, given the dataset at each node, the multimodal features are first detected. For each multimodal feature, we follow a greedy strategy to detect the best threshold value (denoted as split threshold) that splits the set of feature values into subsets so that the unimodality of the partition is increased. We propose a criterion to assess the unimodality of the partition based on the p-values provided by the unimodality test. To improve performance, we combine this criterion with another one that measures the separation of data points before and after the split point, thus obtaining the final criterion used to assess the quality of splitting. Based on this criterion, the best-split threshold for a multimodal feature is determined. The procedure is repeated for every multimodal feature and the feature-threshold pair of highest quality is used to define the decision rule of the node. When the data subset in a node does not contain any multimodal features, i.e., it is axis unimodal, no further splitting occurs, the node is characterized as leaf and a cluster label is assigned to this node. In this way at the end of the method, a partitioning of the original dataset into axis unimodal clusters has been achieved that is interpretable since it is represented by an axis-aligned decision tree.
The proposed DTAUC algorithm is direct (e.g., does not employ k-means as a preprocessing step), end-to-end and relies on the intuitively justified notion of unimodality. It is simple to implement and does not employ computationally expensive optimization methods. It contains no hyperparameters except for the statistical significance level of the unimodality test. The latter remark is important since most unsupervised decision tree methods include hyperparameters such as number of clusters, maximum tree depth, etc., which are difficult to tune in an unsupervised setting.
The outline of the paper is as follows: in Section 2 we provide the necessary definitions and notations. The notion of unimodality is explained, the dip-test for unimodality is briefly presented and the definition of axis unimodal cluster is provided. In Section 3 we describe the proposed method (DTAUC) for constructing unsupervised decision trees for clustering that mainly relies on the computation of appropriate split thresholds for partitioning multimodal features. Comparative experimental results are provided in Section 4, while in Section 5 we provide conclusions and directions for future work.

2. Notations—Definitions

In this section, we provide some definitions needed to present and clarify our method. At first, we define the unimodality for univariate datasets and refer to a widely used statistical test, namely the dip-test [18], for deciding on the unimodality of a univariate dataset. Next, we present the definition for an axis unimodal dataset and provide notations related to the binary decision tree which is built by our method.

2.1. Unimodality of Univariate Data

Let X = { x 1 , , x n } , x i R and x i < x i + 1 an ordered univariate dataset containing distinct real numbers. For an interval [a,b], we define X ( a , b ) = { a x i b , x i X } the subset of X whose elements belong to that interval. Moreover, we denote as F X ( x ) the empirical cumulative distribution function (ecdf) of X, defined as follows:
F X ( x ) = number of elements in the sample x n = 1 n i = 1 n I ( , x ) ( x i ) ,
where I ( , x ) ( x i ) is the indicator function: I ( , x ) ( x i ) = 1 , i f x i x 0 , otherwise . It also holds that F X ( x ) = 0 if x < x 1 , F X ( x ) = 1 if x x n .
In what concerns the unimodality of a distribution there are two definition options. The first relies on the probability density function (pdf): a pdf is unimodal if it has a single mode; a region where the density becomes maximum, while non-increasing density is observed when moving away from the mode. In other words, a pdf f ( x ) is a unimodal function if for some value m, it is monotonically increasing for x m and monotonically decreasing for x m . In that case, the maximum value of f ( x ) is f ( m ) and there are no other local maxima. The second definition option relies on the cumulative distribution function (cdf): a cdf F ( x ) is unimodal if there exist two points x l and x u such that F ( x ) can be divided into three parts: (a) a convex part ( , x l ) , (b) a constant part [ x l , x u ] and (c) a concave part ( x u , ) . It is worth mentioning that it is possible for either the first two parts or the last two parts to be missing. It should be stressed that the uniform distribution is unimodal and its cdf is linear. A distribution that is not unimodal is called multimodal with two or more modes. Those modes typically appear as distinct peaks (local maxima) in the pdf plot. A distribution with exactly two modes is called bimodal.

2.2. The Dip-Test for Unimodality

In order to decide on the unimodality of a univariate dataset, we use Hartigans’ dip-test [18] which constitutes the most popular unimodality test. Given a 1-d dataset X = { x 1 , , x n } , x i R , it computes the dip statistic as the maximum difference between the ecdf of the data and the unimodal distribution function that minimizes that maximum difference. The uniform distribution is the asymptotically least favorable unimodal distribution, thus the distribution of the test statistic (dip statistic) is determined asymptotically and empirically through sampling from the uniform distribution. Specifically, the dip-test computes the d i p ( X ) value, which is the departure from the unimodality of the ecdf. In other words, the dip statistic computes the minimum among the maximum deviations observed between the ecdf F and the cdfs from the family of unimodal distributions. The dip-test returns not only the dip value but also the statistical significance of the computed dip value, i.e., a p-value. To compute the p-value, the class of uniform distributions U is used as the null hypothesis, since its dip values are stochastically larger compared to other unimodal distributions, such as those having exponentially decreasing tails. The computation of the p-value uses b bootstrap sets U n r ( r = 1 , , b ) of n observations each sampled from the U [ 0 , 1 ] uniform distribution. p-value is computed as the probability of d i p ( X ) being less than the d i p ( U n r ) :
P = # [ d i p ( X ) d i p ( U n r ) ] / b
The null hypothesis H 0 that F is unimodal, is accepted at significance level a if p-value  > a , otherwise H 0 is rejected in favor of the alternative hypothesis H 1 which suggests multimodality. The dip-test has a runtime of O ( n ) [18], but since its input must be sorted to create the ecdf, the effective runtime for this part of the technique is O ( n log n ) .
Figure 1 illustrates examples of unimodal and multimodal datasets in terms of pdf plots (histograms) and ecdf plots. The p-values provided by the dip-test are also presented above each subfigure. In Figure 1a a unimodal dataset generated by a Gaussian distribution is illustrated along with the corresponding p-value = 0.98 . p-values close to 1 indicate unimodality and this is also evident in that case. In contrast, Figure 1b presents a dataset generated by two close Gaussian distributions which does not clearly constitute a unimodal or multimodal dataset. This uncertainty is indicated by the computed p-value of 0.08, which is closer to 0. In this case, the significance level defined by the user plays a crucial role in determining unimodality. For example, if the significance level is set to α = 0.1 , then a p-value less than α leads to the dataset being determined multimodal. However, with significance levels of α = 0.01 or α = 0.05 , the dataset would be considered unimodal. In Figure 1c,d two strongly multimodal datasets are shown, with two and three peaks, respectively. The p-value is 0 in both datasets, and the test decides multimodality regardless of the significance level chosen by the user.

2.3. Axis Unimodal Dataset

Let X R d be a dataset consisting of data vectors in a d-dimensional space. Each point x X can be represented as a vector x = ( x 1 , x 2 , , x d ) , where x j R denotes the j-th feature of x for j = 1 , 2 , , d . We also denote as X j the j-th feature vector of the dataset X, which consists of the j-th feature values of all points in X. Given a dataset X, a feature j is characterized as unimodal or multimodal based on the unimodality or multimodality of  X j .
Definition 1.
A d-dimensional dataset X is axis unimodal if every feature j is unimodal, i.e., each univariate subset X j ( j = 1 , , d ) (consisting of the j-th feature values) is unimodal.
Obviously, in order for a dataset X to be decided as axis unimodal, a unimodality test (e.g., dip-test) should decide unimodality for each subset X j . A dataset that is not axis unimodal will be called axis multimodal.

2.4. Node Splitting

Let u be a node during decision tree construction and X the corresponding set of data vectors to be split by applying a thresholding rule on a feature value. A split rule for u is defined as the pair ( j , s p ) { 1 , 2 , , d } × R , where j is the feature on which the rule is applied and s p is the corresponding threshold level. A splitting rule of the form { x X : x j s p } is then applied to the node. We denote the subset of X that satisfies this condition as X L , while the set of points that do not satisfy this condition is denoted as X R . Two child nodes u L and u R are then created corresponding to the subsets X L and X R , respectively. In our method, we aim to determine the feature-threshold pair that results in a decrease in the multimodality of the partition as computed by an appropriately defined criterion. We denote the best-split pair as ( j , s p ), where j and s p denote the best feature and the best split threshold, respectively. If for the dataset X, no features are detected for splitting, then node u is considered a leaf. This occurs when the dataset X is axis unimodal.

3. Axis Unimodal Clustering with a Decision Tree Model

The proposed method can be considered as a divisive (i.e., incremental) clustering approach that is based on binary cluster splitting and produces rectangular axis unimodal clusters. It starts with the whole dataset as a single cluster and, at each iteration, it selects an axis multimodal cluster and splits this cluster into two subclusters. The method terminates when all produced clusters are axis unimodal. Binary cluster splitting is implemented by applying a decision threshold on the values of a multimodal feature. In this way the cluster assignment procedure can be represented with a typical (axis-aligned) decision tree, ensuring the interpretability of the clustering decision. Obviously, the leaves of the decision tree correspond to axis unimodal clusters.
Since our objective is to produce axis unimodal clusters, we consider multimodal features for cluster splitting. Let X be the multimodal cluster to be split and X j be the set of values of a multimodal feature j. Since X j is multimodal, our objective is to determine a splitting threshold such that the splitting of X j will result in two subsets X j L and X j R that are less multimodal than X j (ideally they should be both unimodal). We define a criterion to evaluate the partition ( X j L , X j R ) in terms of unimodality and separation. As in a typical case for decision tree construction, we evaluate several partitions obtained by considering all multimodal features and several candidate threshold values for each feature. The best partition is determined according to the criterion and the corresponding feature-threshold pair, which defines the decision rule for splitting cluster X. The criterion used to evaluate a partition is presented next.

3.1. Splitting Multimodal Features

Let S = { s 1 , s 2 , , s N } denote the set of values corresponding to a multimodal feature. Initially, we sort the values s i , i = 1 , 2 , , N in ascending order. For each i = 1 , 2 , , N 1 we consider the average between s i and its successor s i + 1 as candidate split threshold s p . Given a threshold value s p , S is partitioned into two subsets: a left subset S L (values on the left of s p ) and a right subset S R (values on the right of s p ). Let also N L and N R be the sizes of S L and S R , respectively. The dip-test is then applied to both subsets to assess their unimodality, yielding two p-values: p L for S L and p R for S R . To evaluate the effectiveness of threshold s p , we define a weighted p-value of the partition: p s p l i t = N L N p L + N R N p R .
An intuitive justification of the p s p l i t formula is the following:
  • If both subsets S L and S R are multimodal, then the p L -value and p R -value are low resulting in a low p s p l i t value.
  • If both subsets S L and S R are unimodal, then the p L -value and p R -value are high resulting in a high p s p l i t value.
  • In case one subset (let S L ) is unimodal and the other (let S R ) is multimodal, we need to consider the size of each subset: if N L > N R and since p L > p R (unimodal S L , multimodal S R ) then the resulting p s p l i t value is high. In the opposite case, the set S R becomes the dominant set and thus p s p l i t demonstrates a lower value.
Case 2 describes a scenario where the data splits into two unimodal sets resulting in high p s p l i t values. In case 3 a dominant unimodal subset is compared against a relatively small multimodal subset, also yielding high p s p l i t values. These findings indicate that high p s p l i t values occur when the split highlights one or two unimodal subsets. By selecting the candidate threshold value providing the highest p s p l i t value we obtain a partition of S into subsets of increased average unimodality.
Following the above procedure, we have experimentally noticed that although sensible splittings were generally obtained, the selected threshold s p was not always very accurate. For example, Figure 2a illustrates the histogram of a bimodal dataset along with the computed s p marked with a star. Splitting can be considered successful since the dataset is split into two unimodal subsets; however, the s p presented in Figure 2b is a more accurate split point than the one in Figure 2a. Thus, there is some room for improvement in threshold determination. To tackle this issue we consider not only the unimodality partition, but also the separation of points before and after the threshold s p . More specifically, we consider a subset of w successive points right before s p and a second subset of w successive points right after s p . We define the separation ( s e p ) of a threshold value s p as the average distance among all pairs of points belonging to different subsets. A large distance value indicates a high separation between the points before and after s p , thus s p lies in a density valley. Therefore, the split corresponding to s p is efficient, if the separation is high. We denote as s e p ( s p ) the separation of the points defined by a split point s p . We choose a small value of w (e.g., w = 0.01 × N ), with the choice of value w not affecting the final result. Since s e p computation requires at least w points before and after s p , we do not consider candidate threshold values defined by the first w and last w points of S.
Therefore, since our objective is to determine a threshold with both a high p s p l i t value and high s e p value, we define a new criterion q = p s p l i t × s e p to measure the quality of a split. A high q value provides a split into two highly separated (high s e p value) and unimodal (high p s p l i t value) subsets. Thus, we choose the candidate threshold s p resulting in a maximum q value as the best split threshold and denote it as s p . Algorithm 1 presents the steps of computing the best-split point s p of a univariate dataset S. It first takes as input the univariate dataset S and the significance level α and returns the best-split point s p of S along with the corresponding q value. In case S is unimodal as decided by the dip-test, then Algorithm 1 returns the empty set. Figure 3 illustrates the histogram plots and the best split points (marked as stars) of several datasets which are generated by sampling from mixtures of Gaussian, uniform and triangular distributions.
Algorithm 1  ( s p , q ) = best_split_point ( S , α )
  • p-value ← dip-test( S , α )
  • if p-value > a  return∅    // S: unimodal
  • S s o r t ( S )
  • for each i with w < i N w  do
  •      s p i s i + s i + 1 2
  •      S L i S ( s 1 , s p i )
  •      S R i S ( s p i , s N )
  •      p L i dip-test ( S L i , α )
  •      p R i dip-test ( S R i , α )
  •      p s p l i t i N L i N p L i + N R i N p R i
  •      s e p i s e p ( s p i )
  •      q i p s p l i t i × s e p i
  • end for
  • i arg max i ( q i )
  • s p s p i
  • q q i
  • return ( s p , q )

3.2. Decision Tree Construction

Next, we describe our method, called Decision Trees for Axis Unimodal Clustering (DTAUC), for obtaining interpretable axis unimodal partitions of a multidimensional dataset. Our method employs a divisive (top-down) procedure, thus we first assign the whole initial dataset to the root node. Assuming that at some iteration a node u contains a dataset X, our goal is to determine the splitting rule for node u. This involves determining the best pair consisting of a multimodal feature and the corresponding split threshold.
To identify the best split for u we work as follows: first, we apply the dip-test to detect the multimodal features of X. If all features are unimodal, node u is considered a leaf and no split occurs. If multimodal features exist, then for each multimodal feature j, Algorithm 1 is used to compute its best split threshold s p j and the corresponding evaluation q j of the resulting partition. Among the multimodal features, we select as best the one with maximum q j value. Algorithm 2 describes the steps for determining the splitting rule of a dataset X. It takes the set X and a significance level α as input and returns the best pair ( j , s p ) where j is the selected multimodal feature and s p the corresponding threshold.
Algorithm 2  ( j , s p ) = best_split( X , α )
  • for each feature X j  do
  •      p j -value ← dip-test( X j , a )
  •     if  p j -value a then    // X j : multimodal
  •         ( s p j , q j ) best_split_point( X j , α )
  •     end if
  • end for
  • if  p j -value > a , j  return∅    // X: axis unimodal
  • j arg max j ( q j )
  • s p s p j
  • return ( j , s p )
In case the best split for u exists (i.e., u is not considered as a leaf), the data vectors of X are partitioned into two subsets, X L and X R , based on the feature j values: X L = { x X : x j s p } and X R = { x X : x j > s p } . Therefore two child nodes of u, denoted as u L and u R , are added to the tree, corresponding to sets X L and X R , respectively. Finally, the method is applied recursively on each resulting node, until all nodes are identified as leaves, i.e., the subsets in all nodes are axis unimodal. We assign each leaf a cluster label meaning that each leaf represents a single cluster. Therefore, an axis unimodal partition of the initial dataset X into hyperrectangles is obtained. Algorithm 3 describes the proposed DTAUC method. It takes a multidimensional dataset X and a significance level α as input and returns the constructed tree. It should be emphasized that the algorithm does not require as input the number of clusters which is automatically determined by the method.
Algorithm 3 DTAUC( X , α )
  • Create a root node u corresponding to X
  • ( j , s p ) best_split( X , α )
  • if  ( j , s p ) =  then    // X: axis unimodal
  •     return the leaf u
  • else
  •      X L = { x X : x j s p }
  •      X R = { x X : x j > s p }
  •      u L DTAUC( X L , α )
  •      u R DTAUC( X R , α )
  •     return the decision tree rooted at u
  • end if

3.3. An Illustrative Example

Table 1 presents the intermediate steps from the application of DTAUC on the two-dimensional dataset (called X) illustrated in the first plot of Figure 4a. For each subset of X (listed in first column), we provide the feature along with its unimodal (U)/multimodal (M) property (second column) as determined by the dip-test. The best split thresholds s p and the corresponding q values for each feature are given in the third and fourth columns, respectively. The fifth column indicates whether to split or save the set mentioned in the first column. If a split decision is made, the best split feature is mentioned in parentheses. Either two subsets are created (in case of a split decision) or the set specified in the first column is axis unimodal, thus it is saved in set C which contains the axis unimodal subsets.
In Figure 4 we provide illustrative plots corresponding to the step-by-step partition of the 2-D set X. Figure 4a displays the 2-D plot of the initial dataset X, along with the histogram plots of feature vectors X 1 and X 2 . A higher q value is computed for feature X 2 ( q 2 = 1.96 > q 1 = 0.47 ) as shown in Table 1, thus we apply the split on feature X 2 with the threshold value s p 2 = 4.14 . The partitioning of X into two subsets X L and X R is given in the right plot of Figure 4a. The dotted line illustrates the split threshold s p 2 . The plot of X L is presented in Figure 4b. The first feature is bimodal, while the second is unimodal, as indicated by the histogram plots and the q values for X 1 and X 2 in Table 1. Therefore, the split is applied considering the first feature using the threshold value s p 3 = 6.17 (dotted line in the right plot of Figure 4b). This split results in two subsets, denoted as X L L and X L R . Figure 4c illustrates the 2-D plots of X L L and X L R along with the corresponding histograms for each feature. It is clear that X L L and X L R are axis unimodal, thus we save them in C. The 2-D plot of X R is presented in Figure 4d, where it is clear that each feature is unimodal (the histogram plots and q values for X R in Table 1 indicate unimodality), thus we save it in set C. The final 2-D plot of X is given in Figure 4e, where the two resulting split thresholds (horizontal split s p 2 and vertical split s p 3 ) and the final partition { X L L , X L R , X R } of X are illustrated. The corresponding binary decision tree for dataset X is presented in Figure 5.

4. Experimental Results

In this section we assess the performance of DTAUC on clustering synthetic and real data, focusing on the accurate estimation of the number of clusters and the quality of data partitioning. We compare the DTAUC method with the ICOT method [13] and the ExShallow method [5]. To the best of our knowledge, ICOT is the only method that provides a partition of the data into axis-aligned regions without using the ground-truth number of clusters during training. We also include an indirect method (ExShallow) in our experimental evaluation, in order to compare DTAUC and ICOT with a method that uses the ground-truth number of clusters.
The three methods were applied to both synthetic (from the Fundamental Clustering Problems Suite (FCPS) [23]) and real datasets (from UCI [24]). Since ground truth clustering information is available for each dataset, we evaluated the two methods in terms of splitting (clustering) performance using the widely used Normalized Mutual Information (NMI) score defined as follows:
N M I ( Y , C ) = 2 × I ( Y , C ) H ( Y ) + H ( C ) ,
where Y denotes the ground-truth labels, C denotes the cluster labels, I ( · ) is the mutual information measure and H ( · ) the entropy. This score ranges between 0 and 1, with a value close to 1 indicating that the ground truth partition has been found. All three methods build binary decision trees; therefore, we selected datasets suitable for partitioning into axis-aligned clusters for our experimental evaluation. Table 2 presents the parameters of each dataset (n: number of samples, d: number of features, k : ground-truth number of clusters). We used min-max scaling for all datasets to ensure comparability with the ICOT method, which assumes features in the [ 0 , 1 ] range.
The DTAUC method uses a single parameter, the significance level α , which is necessary for the dip-test to determine data unimodality during the splitting procedure. To determine an appropriate α for each dataset, we used the silhouette score [15] that is commonly used to assess the quality of a clustering solution. Specifically, for each dataset, we run the method for each value of α { 0.01 , 0.05 , 0.1 } , compute the silhouette score for each obtained partition and keep the partition of maximum score as the final partition. In the ICOT method, we utilized a k-means warm start and retained the remaining parameters as specified in [13]. We encountered challenges running ICOT on datasets with a large number of features or clusters. This aligns with observations made by the authors in [13], who reported excessive runtimes for some datasets. For the ExShallow method, we provided the ground-truth number of clusters to run k-means and obtain the cluster labels. Then, a supervised binary decision tree is built by minimizing appropriate metrics as proposed in [5].
Table 3 and Table 4 present for each dataset the NMI values and the number of clusters (k) as provided by the methods (it should be noted that in ExShallow the number of clusters is given). The ground-truth number of clusters ( k ) is also provided in the second column of each table. The performance of DTAUC is superior compared to ICOT in most cases, achieving higher NMI values and closer estimations (k) of the ground-truth number of clusters ( k ). However, DTAUC encounters challenges with some datasets, such as the Tetra dataset, where there is significant overlap among clusters. Another dataset where DTAUC demonstrates inferior performance is the Ruspini dataset. This dataset is relatively small ( n = 75 ) and one of the four clusters is not compact. Consequently, DTAUC splits the noncompact cluster into two subclusters, detecting five clusters instead of four.
Another dataset to be discussed is Synthetic I, a three-dimensional dataset where feature vectors X 1 and X 2 were generated using two Gaussian distributions and two uniform rectangles, and X 3 was generated using a uniform distribution. A 2-D plot of Synthetic I, with axes representing features X 1 and X 2 , is provided in the left plot of Figure 6a. It should be noted that since feature X 3 is uniformly distributed it does not contribute to the splitting process. This dataset is separated by axis-aligned splits; however, the ICOT and ExShallow methods fail in this task. As shown in Figure 6a, ICOT fails to estimate the correct number of clusters ( k = 3 instead of the actual k = 4 ) (right plot), while the partition obtained by DTAUC (middle plot) is successful. The 2-D plot of the ExShallow solution is almost identical to the ICOT plot (right plot in Figure 6a).
DTAUC provides successful data partitions and accurately (or very closely) estimates the number of clusters for most synthetic and real datasets. In datasets (e.g., Ruspini) where sparse clusters exist, DTAUC demonstrates inferior performance compared to ICOT, since, based on the criterion of unimodality, it decides to split those clusters. However, ICOT is inferior in simple datasets, such as Synthetic I and Lsun, particularly when the clusters are close to each other and have a rectangular shape. In what concerns the indirect method (ExShallow), the information provided about the ground truth number of clusters seems to be helpful, in general. However, there are simple datasets where it provides inferior results, such as Synthetic I, Lsun and WingNut. For example, in the case of the WingNut dataset (as illustrated in Figure 6b), a single vertical line is required to split the data into two clusters (left plot); however, ExShallow fails to correctly determine this split (right plot). This mainly occurs due to an incorrect initial partition provided by the k-means algorithm, that is employed in the initial processing step. In this dataset, both DTAUC and ICOT provide a successful solution (middle plot).

5. Conclusions and Future Work

We have proposed a method (DTAUC) for constructing binary trees for clustering based on axis unimodal partitions. This method follows the typical top-down paradigm for decision tree construction. It implements dataset splitting at each node by applying thresholding on the values of an appropriately selected multimodal feature. In order to select features and thresholds, a criterion has been proposed for the quality of the resulting partition that takes into account unimodality and separation. The method automatically terminates when the subsets in all nodes are axis unimodal.
The DTAUC method relies on the idea of unimodality, which is closely related to clustering. It is simple to implement and provides axis-aligned partitions of the data, thus it offers interpretable clustering solutions. In addition, it does not involve any computationally expensive optimization technique, while it demonstrates the significant advantage that (apart from the typical statistical significance level) it does not include user-specified hyperparameters, for example, the number of clusters, the maximum depth of the tree or post-processing techniques, such as a pruning step.
Future work could focus on using a set of features/splitting rules (instead of a single feature/splitting rule) at each node, as oblique trees do. While this would make the resulting trees less interpretable, it would offer more accurate clustering solutions. It is also interesting to implement post-processing steps to improve the performance of DTAUC. In DTAUC each tree leaf represents a single cluster. Several methods merge adjacent leaves into larger clusters, thereby capturing more complex structures in the data. After obtaining the final tree, we could consider the possibility of merging leaves if the unimodality assumption is retained.

Author Contributions

Conceptualization, P.C. and A.L.; methodology, P.C. and A.L.; software, P.C.; validation, P.C.; supervision, A.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CARTClassification and regression trees
CDFCumulative distribution function
CHAIDChi-squared automatic interaction detector
CUBTClustering using unsupervised binary trees
DTAUCDecision trees for axis unimodal clustering
ECDFEmpirical cumulative distribution function
ICOTInterpretable clustering via optimal trees
ID3Iterative dichotomiser 3
NMINormalized mutual information
OCTOptimal classification trees
PDFProbability density function
SEPSeparation
SPSplit threshold

References

  1. Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984. [Google Scholar]
  2. Quinlan, J.R. Discovering rules by induction from large collections of examples. In Expert Systems in the Micro Electronics Age; Edinburgh University Press: Edinburgh, UK, 1979. [Google Scholar]
  3. Quinlan, J.R. C4. 5: Programs for Machine Learning; Morgan Kaufmann Pub: Cambridge, MA, USA, 1993. [Google Scholar]
  4. Kass, G.V. An exploratory technique for investigating large quantities of categorical data. J. R. Stat. Soc. Ser. Appl. Stat. 1980, 29, 119–127. [Google Scholar] [CrossRef]
  5. Laber, E.; Murtinho, L.; Oliveira, F. Shallow decision trees for explainable k-means clustering. Pattern Recognit. 2023, 137, 109239. [Google Scholar] [CrossRef]
  6. Tavallali, P.; Tavallali, P.; Singhal, M. K-means tree: An optimal clustering tree for unsupervised learning. J. Supercomput. 2021, 77, 5239–5266. [Google Scholar] [CrossRef]
  7. Blockeel, H.; De Raedt, L.; Ramon, J. Top-down induction of clustering trees. arXiv 2000, arXiv:cs/0011032. [Google Scholar]
  8. Basak, J.; Krishnapuram, R. Interpretable hierarchical clustering by constructing an unsupervised decision tree. IEEE Trans. Knowl. Data Eng. 2005, 17, 121–132. [Google Scholar] [CrossRef]
  9. Fraiman, R.; Ghattas, B.; Svarc, M. Interpretable clustering using unsupervised binary trees. Adv. Data Anal. Classif. 2013, 7, 125–145. [Google Scholar] [CrossRef]
  10. Liu, B.; Xia, Y.; Yu, P.S. Clustering through decision tree construction. In Proceedings of the Ninth International Conference on Information and Knowledge Management, McLean, VA, USA, 6–11 November 2000; pp. 20–29. [Google Scholar]
  11. Gabidolla, M.; Carreira-Perpiñán, M.Á. Optimal interpretable clustering using oblique decision trees. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 400–410. [Google Scholar]
  12. Heath, D.; Kasif, S.; Salzberg, S. Induction of oblique decision trees. IJCAI 1993, 1993, 1002–1007. [Google Scholar]
  13. Bertsimas, D.; Orfanoudaki, A.; Wiberg, H. Interpretable clustering: An optimization approach. Mach. Learn. 2021, 110, 89–138. [Google Scholar] [CrossRef]
  14. Bertsimas, D.; Dunn, J. Optimal classification trees. Mach. Learn. 2017, 106, 1039–1082. [Google Scholar] [CrossRef]
  15. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
  16. Dunn, J.C. Well-separated clusters and optimal fuzzy partitions. J. Cybern. 1974, 4, 95–104. [Google Scholar] [CrossRef]
  17. Adolfsson, A.; Ackerman, M.; Brownstein, N.C. To cluster, or not to cluster: An analysis of clusterability methods. Pattern Recognit. 2019, 88, 13–26. [Google Scholar] [CrossRef]
  18. Hartigan, J.A.; Hartigan, P.M. The dip test of unimodality. Ann. Stat. 1985, 13, 70–84. [Google Scholar] [CrossRef]
  19. Chasani, P.; Likas, A. The UU-test for statistical modeling of unimodal data. Pattern Recognit. 2022, 122, 108272. [Google Scholar] [CrossRef]
  20. Kalogeratos, A.; Likas, A. Dip-means: An incremental clustering method for estimating the number of clusters. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 2393–2401. [Google Scholar]
  21. Maurus, S.; Plant, C. Skinny-dip: Clustering in a sea of noise. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1055–1064. [Google Scholar]
  22. Vardakas, G.; Kalogeratos, A.; Likas, A. UniForCE: The Unimodality Forest Method for Clustering and Estimation of the Number of Clusters. arXiv 2023, arXiv:2312.11323. [Google Scholar]
  23. Ultsch, A. Fundamental Clustering Problems Suite (Fcps); Technical Report; University of Marburg: Marburg, Germany, 2005. [Google Scholar]
  24. Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu (accessed on 20 May 2024).
Figure 1. Histogram and ecdf plots of unimodal and multimodal univariate datasets. The p-values provided by the dip-test are also presented. (a) Unimodal dataset. (b) Borderline case of unimodal dataset (with two close peaks). (c) Multimodal dataset (with two peaks). (d) Multimodal dataset (with three peaks).
Figure 1. Histogram and ecdf plots of unimodal and multimodal univariate datasets. The p-values provided by the dip-test are also presented. (a) Unimodal dataset. (b) Borderline case of unimodal dataset (with two close peaks). (c) Multimodal dataset (with two peaks). (d) Multimodal dataset (with three peaks).
Information 15 00704 g001
Figure 2. Histogram of a bimodal dataset along with its split threshold (star). (a) The split threshold was computed without utilizing the separation criterion. (b) The split threshold was computed taking into account the separation criterion.
Figure 2. Histogram of a bimodal dataset along with its split threshold (star). (a) The split threshold was computed without utilizing the separation criterion. (b) The split threshold was computed taking into account the separation criterion.
Information 15 00704 g002
Figure 3. Histogram plots of synthetic datasets along with the best split thresholds found (stars).
Figure 3. Histogram plots of synthetic datasets along with the best split thresholds found (stars).
Information 15 00704 g003
Figure 4. Stepwise partitioning of a 2-D dataset (X) into axis unimodal rectangular regions. (a) 2-D plot of the original dataset X, with histogram plots of each feature, the obtained split points, and the resulting 2-D plot illustrating X split (by s p 2 ) into two clusters ( X L , X R ). (b) 2-D plot of X L , with histogram plots of each feature, the obtained split point, and the resulting 2-D plot illustrating X L split (by s p 3 ) into two clusters ( X L L , X L R ). (c) 2-D plots of X L L and X L R , along with the unimodal histogram plots of each feature. (d) 2-D plot of X R , along with the unimodal histogram plots of each feature. (e) Final 2-D plot of X, illustrating the final split points ( s p 2 , s p 3 ) that partition X into three axis unimodal clusters ( X L L , X L R , X R ).
Figure 4. Stepwise partitioning of a 2-D dataset (X) into axis unimodal rectangular regions. (a) 2-D plot of the original dataset X, with histogram plots of each feature, the obtained split points, and the resulting 2-D plot illustrating X split (by s p 2 ) into two clusters ( X L , X R ). (b) 2-D plot of X L , with histogram plots of each feature, the obtained split point, and the resulting 2-D plot illustrating X L split (by s p 3 ) into two clusters ( X L L , X L R ). (c) 2-D plots of X L L and X L R , along with the unimodal histogram plots of each feature. (d) 2-D plot of X R , along with the unimodal histogram plots of each feature. (e) Final 2-D plot of X, illustrating the final split points ( s p 2 , s p 3 ) that partition X into three axis unimodal clusters ( X L L , X L R , X R ).
Information 15 00704 g004aInformation 15 00704 g004b
Figure 5. Binary decision tree constructed for the two-dimensional dataset of Figure 4a.
Figure 5. Binary decision tree constructed for the two-dimensional dataset of Figure 4a.
Information 15 00704 g005
Figure 6. 2-D plots of (a) Synthetic I and (b) WingNut. The ground truth partition and the partitions obtained by DTAUC, ICOT and ExShallow are provided.
Figure 6. 2-D plots of (a) Synthetic I and (b) WingNut. The ground truth partition and the partitions obtained by DTAUC, ICOT and ExShallow are provided.
Information 15 00704 g006
Table 1. Stepwise partitioning of the two-dimensional dataset of Figure 4a.
Table 1. Stepwise partitioning of the two-dimensional dataset of Figure 4a.
SetsFeatures sp qSplit ( j ) or SaveResult
X1 (M) s p 1 = 9.29 q 1 = 0.47 Split X  ( j = 2 ) Sets X L , X R
2 (M) s p 2 = 4.14 q 2 = 1.96
X L 1 (M) s p 3 = 6.17 q 3 = 6.72 Split X L   ( j = 1 ) Sets X L L , X L R
2 (U)
X L L 1 (U)Save X L L C = { X L L }
2 (U)
X L R 1 (U)Save X L R C = { X L L , X L R }
2 (U)
X R 1 (U)Save X R C = { X L L , X L R , X R }
2 (U)
Table 2. Parameters of synthetic and real datasets used in the experiments.
Table 2. Parameters of synthetic and real datasets used in the experiments.
Datasetnd k
Synthetic
Synthetic I75034
Hepta21237
Lsun40023
Tetra40034
TwoDiamonds80022
WingNut101622
Real
Boot Motor9433
Dermatology366336
Ecoli32775
Hist OldMaps429310
Image Seg.210197
Iris15043
Ruspini7524
Seeds21073
Table 3. Partition results on synthetic data reported: (i) The estimated number of clusters (k) and (ii) NMI values with respect to the ground truth labels. The ground truth number of clusters ( k ) is also reported.
Table 3. Partition results on synthetic data reported: (i) The estimated number of clusters (k) and (ii) NMI values with respect to the ground truth labels. The ground truth number of clusters ( k ) is also reported.
Dataset k /NMIDTAUCICOTExShallow
Synthetic I k = 4 k = 4 k = 3
NMI0.990.770.60
Hepta k = 7 k = 7 k = 4
NMI0.950.741.00
Lsun k = 3 k = 3 k = 4
NMI0.970.730.53
Tetra k = 4 k = 4 k = 4
NMI0.941.001.00
Two Diamonds k = 2 k = 2 k = 2
NMI1.001.001.00
WingNut k = 2 k = 2 k = 2
NMI1.001.000.17
Table 4. Partition results on real data reported: (i) The estimated number of clusters (k) and (ii) NMI values with respect to the ground truth labels. The ground truth number of clusters ( k ) is also reported.
Table 4. Partition results on real data reported: (i) The estimated number of clusters (k) and (ii) NMI values with respect to the ground truth labels. The ground truth number of clusters ( k ) is also reported.
Dataset k /NMIDTAUCICOTExShallow
Boot Motor k = 3 k = 3 k = 3
NMI1.001.000.99
Dermatology k = 6 k = 28 k = 2
NMI0.550.440.83
Ecoli k = 5 k = 3 k = 2
NMI0.610.010.54
Hist OldMaps k = 10 k = 10 k = 2
NMI0.740.030.75
Image Seg. k = 7 k = 7 k = 2
NMI0.690.010.60
Iris k = 3 k = 2 k = 2
NMI0.730.730.81
Ruspini k = 4 k = 5 k = 4
NMI0.891.001.00
Seeds k = 3 k = 2 k = 2
NMI0.630.530.66
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chasani, P.; Likas, A. Unsupervised Decision Trees for Axis Unimodal Clustering. Information 2024, 15, 704. https://doi.org/10.3390/info15110704

AMA Style

Chasani P, Likas A. Unsupervised Decision Trees for Axis Unimodal Clustering. Information. 2024; 15(11):704. https://doi.org/10.3390/info15110704

Chicago/Turabian Style

Chasani, Paraskevi, and Aristidis Likas. 2024. "Unsupervised Decision Trees for Axis Unimodal Clustering" Information 15, no. 11: 704. https://doi.org/10.3390/info15110704

APA Style

Chasani, P., & Likas, A. (2024). Unsupervised Decision Trees for Axis Unimodal Clustering. Information, 15(11), 704. https://doi.org/10.3390/info15110704

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop