Next Article in Journal
Constant-Time Complete Visibility for Robots with Lights: The Asynchronous Case
Next Article in Special Issue
The Traffic Grooming Problem in Optical Networks with Respect to ADMs and OADMs: Complexity and Approximation
Previous Article in Journal
Granular Classification for Imbalanced Datasets: A Minkowski Distance-Based Method
Previous Article in Special Issue
Hardness of an Asymmetric 2-Player Stackelberg Network Pricing Game
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimal Clustering in Stable Instances Using Combinations of Exact and Noisy Ordinal Queries

Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
*
Author to whom correspondence should be addressed.
Algorithms 2021, 14(2), 55; https://doi.org/10.3390/a14020055
Submission received: 11 January 2021 / Revised: 2 February 2021 / Accepted: 4 February 2021 / Published: 8 February 2021
(This article belongs to the Special Issue Graph Algorithms and Network Dynamics)

Abstract

:
This work studies clustering algorithms which operates with ordinal or comparison-based queries (operations), a situation that arises in many active-learning applications where “dissimilarities” between data points are evaluated by humans. Typically, exact answers are costly (or difficult to obtain in large amounts) while possibly erroneous answers have low cost. Motivated by these considerations, we study algorithms with non-trivial trade-offs between the number of exact (high-cost) operations and noisy (low-cost) operations with provable performance guarantees. Specifically, we study a class of polynomial-time graph-based clustering algorithms (termed Single-Linkage) which are widely used in practice and that guarantee exact solutions for stable instances in several clustering problems (these problems are NP-hard in the worst case). We provide several variants of these algorithms using ordinal operations and, in particular, non-trivial trade-offs between the number of high-cost and low-cost operations that are used. Our algorithms still guarantee exact solutions for stable instances of k-medoids clustering, and they use a rather small number of high-cost operations, without increasing the low-cost operations too much.

1. Introduction

Clustering is a fundamental and widely studied problem in machine learning and in computational complexity as well. Intuitively, the goal is to partition a given set of points X (data set) into k clusters, each cluster corresponding to a group of “similar” data according to some underlying distance function d ( · ) . Though most of the clustering problems are NP-hard in the worst case, for real data, certain (relatively simple) heuristics seem to perform quite well. One theoretical justification to support this empirical observation is that “real” instances have some “nice” distribution, and some polynomial-time algorithms come with the guarantee that, on these inputs, they return optimal solutions. Along these lines, one successful concept is that of stability of the input, meaning that a “small” perturbation of the input does not change the optimal solution (though it may change its value).
Quite surprisingly, for a rather general class of clustering problems, the following very simple algorithm guarantees exact solutions in stable instances in polynomial-time [1,2,3] (see also [4] for a nice introduction). Roughly speaking, this algorithm first computes a (minimum) spanning tree over the data pairwise distances or dissimilarities, and then removes a suitable subset of edges to obtain the optimal k-clustering. The latter step amounts to identify the k 1 edges whose removal partitions the spanning tree into k clusters (the nodes of the resulting forest) in order to minimize the underlying cost function of the clustering problem (the exact cost function depends on how we define the “center” of the clusters and on the distances of the points in the cluster to its center).
The implicit assumption is that the algorithm is provided with the correct pairwise dissimilarity/distances metric between all pairs of data points. In this work, we take a step further and consider the natural setting where we can only compare distances and these comparisons are also noisy. Furthermore, by accessing some type of expensive (expert) oracle, we can make sure that certain comparisons are correctly reported, though these operations are inherently more expensive and thus we would like to use as few as possible. In a nutshell, our work considers and combines the following three aspects that are often present in practical machine learning approaches:
  • Using only ordinal information;
  • Dealing with noisy data;
  • Allowing expensive operations to remove errors.
This situation arises, for example, in semi-active learning approaches where the pairwise dissimilarity between objects (data points) is evaluated by humans via simple comparison queries (see, e.g., [5,6,7] and references therein). These are inherently subject to erroneous evaluations. Moreover, very often, the task is to compare objects and their pairwise dissimilarities. Such queries can also be of varying difficulty and thus more or less informative/costly. It is therefore natural to ask the following questions:
Which guarantees are still possible under such model?
What trade-offs between expensive and non-expensive (noisy) operations still allow for finding optimal solutions?

1.1. Our Contribution

In this work, we address these questions by (i) introducing a formal model and (ii) by considering a class of clustering problems/algorithms in this model. Specifically, we consider k-medoids clustering which applies to general dissimilarities of data (unlike, e.g., k-means), and for which the above “minimum-spanning-tree-based” optimal algorithm for stable instances can be implemented using (certain) ordinal queries. We detail our contributions and its relation to prior work in the following sections.

1.1.1. Our Model

This paper introduces a natural setting where we can only compare distances, and these comparisons are generally noisy; though use some sort of expensive (expert) oracle, we can make sure that certain comparisons are correctly reported. Our model (see Section 2 for formal definitions) captures the following aspects:
  • Using only ordinal information. Distances or dissimilarities can only be compared and not directly evaluated. One example of such queries is [5,6,7]
      Which one between y and y is more dissimilar to x?
    Concretely, this means that we can compare d ( x , y ) with d ( x , y ) for some dissimilarity function d ( · ) . In our model, we allow for comparing arbitrary groups (sum) of pairwise distances.
  • Dealing with noisy results. Comparisons are in general noisy and the resulting error is persistent. Due to measurements, we may report the wrong answer with probability bounded by some error probability p < 1 / 2 , and repeating the same measurements would lead to the same answer [7,8,9]. (In order to import some results from the literature, we shall typically assume p < 1 / 16 . All our results automatically extend to larger p if these prior results in the literature also do.)
  • Allowing expensive operations to remove errors. Errors can be fixed by performing some kind of expensive comparison. In this case, we know the answer is correct, but we would like to limit the total number of such expensive operations (see e.g., [10] for a survey on various practical methods).
This setting falls into the dual-mode framework in [11], where several optimization problems can be optimally solved using either only low-cost operations or a suitable combination with few high-cost ones. This framework suggests to evaluate the complexity of an algorithm by explicitly counting the low-cost and high-cost operations separately. Without errors, or ignoring that exact operations have high-cost, the problem falls into the class of problems that can be solved only with ordinal operations. Without the aid of high-cost operations that is using only noisy comparisons, the problem has been recently studied in the context of active learning under various query and error models (see, e.g., [12,13] and below for further discussions).

1.1.2. Algorithms and Bounds for k-Medoids

We provide new polynomial-time algorithms for k-medoids clustering. These algorithms achieve the following trade-offs between the number of high-cost and low-cost operations to compute optimal k-clustering in stable instances (with stability parameter γ 2 —see Definition 1):
  • In Section 3, we investigate variants of the popular Single-Linkage algorithm, and its enhanced version Single-Linkage++ analyzed in [1,2,3]. (Following [4], we call Single-Linkage the simpler algorithm, which often used in practice, and Single-Linkage++ the enhanced version with provable guarantees [1,2,3]). This algorithm consists of two distinct phases (computing a minimum spanning tree and removing a suitable set of edges to obtain a clustering). A naive implementation, using only high-cost comparisons, would require O ( n 2 ) for such operations for the first phase and O ( n log n ) for the second one. The trade-offs are summarized in Table 1, where we also consider a naive (simpler) version of the algorithm with no approximation guarantee (this serves to illustrate some key ideas and develop some necessary tools). All other variants are either exact or guarantee a 2-approximation in the worst case. At this point, some comments are in order:
    The overall algorithm consists of a combination of Phase 1 and Phase 2, and thus the overall complexity is given by the sum of these two. The total number of operations (also accounting for internal computations) is comparable with the sum of the high-cost and low-cost operations.
    Phase 1 improves over the naive implementation (Remark 3) under some additional assumptions on the data (in particular, larger stability parameter γ > 2 helps, as well as a small radius—the ratio d max ( X ) / d min ( X ) between the largest and the smallest distance between two points).
    The naive algorithm (Single-Linkage) assumes that Phase 1 has been already solved and implements a simpler Phase 2 (the complexity bounds refer only to the latter). This algorithm is in fact a widely used heuristic with additional theoretical guarantees for hierarchical clustering (see, e.g., [14,15]).
    Phase 2 can be implemented using very few high-cost operations, if we content with a 2-approximate solution. Exact solutions use larger number high-cost operations. Though the dynamic programming (DP) approach has a better theoretical performance for large k, the other algorithm is much simpler to implement (for example, it does not require memory O ( k 2 n 4 log n ) used to store the DP table).
    We remark that the best combination between Phase 1 and Phase 2 depends on k and, in some cases, on additional properties of the input data X.
  • In Section 4, we show that, under additional assumptions on the input data, and a slightly more powerful comparison model, it is possible to implement exact or approximate same-cluster queries (see Section 4.1 and Lemma 10 therein).
  • Since same-cluster queries may require some additional power, in Section 4.2, we provide algorithms which use few same-cluster queries in combination with our original (low-cost and high-cost) comparison operations. The obtained bounds are summarized in Theorem 5 and Theorem 6. Intuitively speaking, the ability to preform “few” exact same-cluster queries allows for us to reduce the number of high cost-operations significantly, at least for certain instances:
    When the optimal solutions has approximately balanced clusters, O ( k log k ) same-cluster queries are enough, and the additional high-cost comparisons are O ( log 2 n ) . Both bounds scale with the cluster “unbalance” n / n 1 , where n 1 is the smallest cluster size (slightly better bounds for the number of same-cluster queries hold).
    The additional assumption on the input data is only required to be able to implement these few same-cluster queries directly from our model. The result still applies in general if these same-cluster queries can be be implemented in a different—perhaps quite expensive—fashion.
    The aforementioned condition to simulate same-cluster queries, essentially requires that, in the optimal solution, points in the same cluster have a distance of at most γ 1 times the minimum distance. For larger γ , this condition becomes less stringent, though of course we require a larger stability coefficient.
All above mentioned algorithms (and results) hold with high probability, that is, with probability at least 1 n c , for some constant c > 0 (we actually prove success probability at least 1 3 n or larger in all cases), where the probability is on the outcome of the distance comparisons (errors).

1.1.3. Techniques and Relation to Prior Work

This work focuses on the so-called k-medoids clustering problem, where the center of each cluster (its medoid) must be a point of the cluster (and thus of the data set) [16,17,18,19]. This is slightly different from k-means, where the centroid of a cluster may not be an element of the cluster/data set. It is well known that k-medoids clustering has several advantages. First, it can be applied to any dissimilarity/distance function between data points, while k-means requires the points to be embedded in some Eulidean space . Moreover, it is well known that k-medoids is more stable to outliers (see, e.g., [20]). Despite k-medoids being NP-hard, it has been recently shown that, for stable instances, a relatively simple (Single-Linkage++) algorithm solves the problem exactly in polynomial-time [1,2,3] (see also [4] for a nice introduction).
The algorithms implementing Single-Linkage++ (Table 1) are based on two algorithmic results regarding, respectively, approximate sorting with noisy comparisons [21], and approximate matroid maximization with high-cost and low-cost operations [11]. The dynamic implementation of Phase 2 is instead an adaptation of the algorithm in [3] to work with our comparison-based model. Their algorithmic result is in fact more general, and it applies to the class of center-based clustering problems (intuitively, the solution and its cost are uniquely determined by a set of centers—in our case medoid).
Without allowing high-cost operations that are using only noisy comparisons, the problem has been studied in the context of (semi-supervised) active learning which involves (noisy) comparison queries of varying “complexity”. Specifically, Refs. [12,13] consider comparisons between pairs of classifiers in some fixed hypothesis class (a noisy comparison between two candidate classifiers h and h * consists of comparing the loss of these classifiers on a small set labelled data set). In [22], the authors consider queries of the form “ f ( x ) f ( y ) ” for specific functions f ( · ) . Several works consider ordinal queries involving distances between (few) points: “triplets” are considered in [5,7] with queries of the form “ d ( x , y ) d ( x , z ) ” (is x more similar to y or to z) and in [23] with queries giving the outlier among the three points “ d ( x , y ) min { d ( x , z ) , d ( y , z ) } ”. In [14], “quadruples” queries of the form “ d ( x , y ) d ( z , w ) ” are used to simulate more complex queries (e.g., implementing the naive Single-Linkage algorithm in a noisy-free setting). In [8,9], queries involving some “scalar/multiplicative” factor α 1 —similar to what we use to simulate same-cluster queries—are used; Their queries are of the form “ α d ( x , z ) d ( y , z ) ” and the answer is correct if the inequality holds, but the oracle may not answer whenever these distances are “close”, i.e., d ( x , z ) d ( y , z ) < α d ( x , z ) ; [9] considers the variant in which the answer is adversarially wrong in this case.
It widely believed that same-cluster queries may be difficult to implement in practice, though very powerful. Algorithms based on same-cluster queries both exact and with errors have largely been studied. The error-free/exact case is closely related to our “all at high cost” implementations (assuming one same-cluster query costs as one exact call in our model). Specifically, Ref. [24] considers k-means and provide an algorithm using O ( k 2 log k + k log n ) same cluster queries, while Ref. [25] uses O ( k 14 log k log n ϵ 6 ) same-cluster queries for computing ( 1 + ϵ ) -approximate correlation clustering; Ref. [26] provides an exact algorithm using 2 C O P T same-cluster queries, where C O P T denotes the number of “disagreements” in the optimal solution. Same-cluster queries can also solve non-center-based clustering problems [27], where the corresponding algorithm uses O ( k 3 log k log n ) , with the hidden constant depending exponentially in the dimensionality of the clusters. Regarding noisy same-cluster queries, Ref. [28] uses O ( n k log n ( 1 2 p ) 2 ) same-cluster queries to reconstruct the “latent” clusters. The closest result to ours is probably [29], proving that γ -pertubation-stable instances with γ 3 can be solved using O ( n log 2 n ) noisy same-cluster queries [29], and with O ( n ) queries in the noise-free case. Their result applies to a rather general class of center-based clustering problems (including ours). On the one hand, our algorithms use fewer low-cost noisy comparisons, namely O ( n log n ) , though for a restricted class of inputs; for balanced clusters, the same-cluster queries are O ( k log k ) , though we use exact queries. On the other hand, in some cases, same-cluster queries may be harder or more costly than comparisons of distances, and thus the costs may be incomparable in general. Finally, our algorithm using few same-cluster queries uses similar ideas to (coupon collector and the double Dixie cup extension) in [30] for k-mean instances that satisfy a γ -margin property (similar to γ -perturbation-stability, though not equivalent).

2. Model and Preliminary Definitions

An instance is a pair ( X , d ) where X is a dataset of n = | X | points whose pairwise distances are specified by a non-negative function d : X × X R + satisfying symmetry and triangle inequality: d ( x , x ) = 0 , d ( x , y ) = d ( y , x ) , and d ( x , y ) d ( x , z ) + d ( z , y ) for any three points x , y , z X . The distance function extends naturally to subsets of pairwise distances e = ( x , y ) , i.e., subsets of edges. Specifically, for E = X × X being the set of all edges, and for any S E of pairwise distances, we let
d ( S ) : = e S d ( e ) .

2.1. Stable Instances

For a given metric space ( X , d ) as above and a positive integer k, a clustering is a partition C = { C 1 , , C k } of X. The cost of a cluster C i with respect to a point x X is defined as
C o s t ( C i , x ) : = y C i d ( y , x ) .
The medoid (or centroid) of each cluster C i is the point in that cluster minimizing this cost, i.e., c i : = arg min x C i C o s t ( C i , x ) , and the cost of a cluster is simply C o s t ( C i ) : = C o s t ( C i , c i ) . Then, the cost of the clustering is the sum of the cost of each cluster,
C o s t ( C ) : = i = 1 k C o s t ( C i ) .
A clustering that minimizes this cost is called optimal k-clustering. (In the literature, this is sometimes called k-medoid. Since in this work we use centroid and medoid interchangeably, we simply use the term k-centroid.)
Definition 1
( γ -perturbation stability). A γ-perturbation of a metric space ( X , d ) , for γ 1 , is another metric space ( X , d ) such that, for all x , y X ,
1 γ d ( x , y ) d ( x , y ) d ( x , y ) .
For a given positive integer k, a metric space ( X , d ) is γ-perturbation-stable if there is a k-clustering C 1 * , , C k * , which is the unique optimal k-clustering in every γ-perturbation of ( X , d ) .
Remark 1.
Observe first that the above definition restricts the perturbations from being metric too. Moreover, if a metric space is γ-perturbation-stable, then it has a unique optimal solution (regardless of the value of γ 1 ).
For γ 2 , there exists an exact polynomial-time algorithm for the k-clustering problem [3] (see also [2] for algorithms for γ 3 ). The algorithm exploits the following key property of such stable instances. Intuitively, in γ -perturbation-stable instances, in the optimal clustering, every point is “much closer” to the centroid of its own cluster than to any other centroid:
Lemma 1
( γ -center proximity [2,3]). Let ( X , d ) be γ-perturbation-stable and let C 1 * , , C k * be its (unique) optimal solution. Then, for every x X , with x C i * , and every j i , we have that
d ( x , c j * ) > γ · d ( x , c i * )
where c i * , c j * are the respective centroids of C i * , C j * .

2.2. Comparisons and Errors

We consider the scenario in which the distances d ( x , y ) between points, as well as those relative to subsets (1), are not directly measurable. Distances can only be compared either using a cheap but erroneous operation or an expensive but always correct operation. Specifically, we let O H ( · ) denote the expensive and reliable operation (oracle) defined as
O H ( E 1 , E 2 ) = + 1 if d ( E 1 ) > d ( E 2 ) 1 otherwise
for any two subsets E 1 , E 2 E . The cheap but erroneous operations (oracle) is denoted by O L ( · ) , and its answers are wrong with a probability of at most p independently across all pairs E 1 and E 2 . That is, for any two subsets E 1 , E 2 E as above,
Pr [ O L ( E 1 , E 2 ) O H ( E 1 , E 2 ) ] p ,
and these errors are persistent, that is, repeating the same comparison operation O L ( E 1 , E 2 ) multiple times always gives the same result (either always wrong or always correct). We shall sometimes assume that p < 1 / 16 as in prior works [21] in order to apply these results (though our approach/results are generic, in the sense that they can be parameterized by the dislocation of sorting with a generic p or even a generic error model).
Remark 2.
A weaker model would be to allow only comparisons between the weights/distances of two edges, analogously to pairwise comparisons in noisy sorting [11]. Unfortunately, this model seems too weak for the k-medoids clustering problem, since it is not possible to directly compute the optimal centroid of a cluster of more than four nodes. For this reason, we adopt the model that allows for comparing two setsof edges.
Observation 1.
One might be tempted to simulate repeated comparisons between two distances, d ( x , y ) and d ( x , y ) , via various comparisons of subsets, e.g., d ( { ( x , y ) , ( a , b ) } ) and d ( { ( x , y ) , ( a , b ) } ) . Though this is in principle possible, since comparisons are persistent, whenever the underlying algorithm will query the same subset, the answer is the same of that in our “simulated repeated comparison”. This happens, for instance, if the algorithm during its execution needs to compare the following two candidate clustering solutions:
C 1 = { a , b } , C 2 = { x , y } , C 3 = { x } and C 4 = { y } ; C 1 = { a , b } , C 2 = { x , y } , C 3 = { x } and C 4 = { y } .
In this work, we deliberately choose to not attempt any simulation of “repeated” comparison because of this issue. This has the additional advantage that some of our results might be in principle applicable to different error models where costs might be dependent on the sets, or the error probabilities might depend on the distance values involved.

2.3. Performance Evaluation

We evaluate the performance of our algorithms by distinguishing between the two types of queries we make: If h ( n ) is the total number of query to O H and l ( n ) to O L , where n = | X | , we say that the algorithm has h ( n ) , l ( n ) high-low cost complexity [11]. Furthermore, we use the standard notation O ( · ) and write O h ( n ) , l ( n ) to denote O ( h ( n ) ) , O ( l ( n ) ) , while with O ( t ( n ) ) we denote only the total number of high-cost operations (this corresponds to the usual complexity notation where all operations are error free).

2.4. Two Algorithmic Tools

We will use two key results related to our error model, namely, sorting with only erroneous comparisons, and approximate solutions for matroids, respectively.
Lemma 2
(approximate sorting [21]). Given a sequence S of n elements, and a comparison query that flips the answer with a small probability p < 1 / 16 , there exists an algorithm with the following performance guarantee. For any confidence parameter Δ > 0 ,
  • The algorithm uses O ( n log n ) low-cost queries only (and no high-cost query). Each query compares a pair of elements, and these low-cost queries have an error probability p < 1 / 16 . These comparison errors are persistent.
  • The algorithm returns an almost sorted sequence S ^ , where the dislocation of each element is at most O ( Δ log n ) with probability of at least 1 1 n Δ (the probability depends only on the outcome of the comparisons).
The dislocation of an element x in a sequence S ^ is the absolute difference between the position of x in S ^ and the position of x in the correctly sorted sequence S.
Lemma 3
(approximate matroid [11]). Given a matroid ( M , F ) and two high-low cost oracles in order to compare the elements of M, it is possible to find an ( 1 + ϵ ) -approximation of the optimal base using O 1 ϵ ( log n ) 2 , n log n high-low queries.

3. Clustering in Stable Instances

In order to describe the algorithms, it is convenient to think of the input (metric space) as the following (complete) induced weighted graph:
Definition 2
(induced graph, spanning forests, clustering). Given a metric space ( X , d ) , the induced graph is a complete weighed undirected graph G X = ( X , E X , w ) where E X = X 2 and for every edge e = ( x , y ) we have w ( e ) = d ( x , y ) . For any tree T spanning all nodes X, and for any subset of edges F = T K , let us denote by C ( F ) the connected components (nodes) of the forest F. These connected components is a k-clustering whenever we remove k 1 edges from a spanning tree T.
The known optimal polynomial time algorithms for stable clustering are based on the above tree/forest construction. The simplest of such algorithms is the following one.
Algorithms 14 00055 i001
It is well-known that this naive algorithm does not achieve any (bounded) approximation guarantee even in our γ -pertubation-stable metric instances [2] (see also [15]). The reason is that the criteria for computing the removed edges K does not directly take into account the cost of the resulting clustering. The following algorithm indeed finds the optimum in several stable clustering problems [1,2,3].
Algorithms 14 00055 i002
Theorem 1
([3]). The Single-Linkage++ algorithm finds the unique optimal k-clustering in every γ-pertubation-stable instance, with γ 2 .
Both algorithms above consists of two phases, each of them corresponding to solve exactly a matroid problem. Unfortunately, in order to have a small number of high-cost operations (Lemma 3), we have to content ourself with approximate solutions.

3.1. Warm-Up: Phase 2 of Single-Linkage Algorithm

In order to describe some difficulties and to convey some basic ideas, we shall first consider implementing Phase 2 of Single-Linkage algorithm only. We thus assume that the (exact) Minimum Spanning Tree (MST) has been already computed (or it is given). Observe that Phase 2 of Single-Linkage algorithm boils down to the problem of selecting the top k 1 elements from a set of n 1 edges of the MST, thus a simple matroid. We first observe the following three facts on this task (details below):
  • There is a naive approach using O ( k + ( n k ) log k ) = O ( n log k ) high-cost operations in total (and no low-cost operation).
  • Approximate sorting (Lemma 2) can reduce the total number of high-cost operations to O ( k + d log k ) , where d = O ( log n ) .
  • Approximate the matroid (Lemma 3) would directly further improve the above bound for some values of k. Unfortunately, this leads to a solution which is far more costly than the one returned by the algorithm with exact comparisons (or with the previous methods).
As we discuss below, though Single-Linkage has no approximation guarantee, we show that, in some stable instances where it would find an optimal solution, the matroids’ approximation makes it compute a solution of much larger cost. This suggests that Phase 2 of this algorithm (removing the k 1 edges) is a “fragile” part of the algorithm (also for the more complex Single-Linkage++). Instead, we show in the next section that the first step (computing the MST) can be done approximately without “destroying” the clustering optimality.
In the remainder of this section, we discuss briefly some details on the three items above.

3.1.1. Naive Approach (all at High Cost)

We create a Min Heap with the first k 1 elements of E T , and then iterate over the remaining ones: Every time an element is greater than the root, we replace the root with this element and we re-balance the tree. Finally, we return all the elements in the Min Heap. This strategy takes O ( k + ( n k ) log k ) high-cost operations.

3.1.2. First Improvement (Combining Low and High Cost Operations)

Using Lemma 2, we can sort the elements in an array using only O ( n log n ) low-cost operations, and obtain a sequence with maximum dislocation d = O ( log n ) . Having sorted the array in this way, we can restrict to the first m = k 1 + d , elements (since we know that the k 1 heaviest elements are in this interval). By applying the previous Min Heap strategy to these m elements, we can find the true k 1 heaviest edges. This strategy leads to the following slightly more general result, which we shall use below as a subroutine:
Lemma 4
(find top-t elements). Given any set of n elements, and any confidence parameter Δ, with probability at least 1 1 n Δ , we can extract the top-t (largest or smallest) elements using O t + Δ log t log n , n log n high-low cost operations.
Proof. 
According to Lemma 2, with probability at least 1 1 n Δ , we can obtain a sequence where every element has dislocation at most d = O ( Δ log n ) using O ( n log n ) low-cost operations only. To find the t largest elements, we consider the first m = t + d elements and apply the Min Heap strategy to these only. This takes O ( t + ( m t ) log t ) = O ( t + d log t ) high-cost operations only as observed above. The overall cost is thus
O 0 , n log n + O t + d log t , 0 = O t + Δ log t log n , n log n .
The previous lemma yields the following result.
Theorem 2.
Phase 2 of Single-Linkage algorithm can be implemented using
O k + log k log n , n log n
high-low cost operations, and its success probability is at least 1 1 n .

3.1.3. Matroid Approximations Fail

We can use matroids in order to find the best forest F from which we can generate the k-clustering. In particular, we can consider the matroid M = ( E , I ) , where E is the set of all the edges in the MST T and F = { e 2 E X | | e | n k } is the family of forests that can be obtained by removing exactly k 1 edges from T. Then, we have that the best forest F is simply the minimum base B of matroid M.
We could then apply Lemma 3 and obtain an ( 1 + ϵ ) -approximation of the base using O ( log n ) 2 ϵ , n log n high-low cost complexity. Note that this would improve significantly the previous bound for k ( log n ) 2 . Unfortunately, the next example shows that this approach leads to a solution whose cost is unbounded compared to the solution returned by the algorithm.
Example 1
(unbounded error for the approximate matroid strategy). Let us consider instance X consisting of k groups, G 1 , , G k , of m points each and one additional point v. The distance function d has the following values for an arbitrary small ϵ > 0 and some very large L:
  • Distances inside the same group are 0, i.e, for any G i and any two x , y G i , we have d ( x , y ) = 0 .
  • Distances between groups are 1 + ϵ , i.e, for any two different groups G i and G j , and for x G i and y G j , we have d ( x , y ) = 1 + ϵ .
  • Point v is at distance 1 from points in G 1 and distance L from all other points, i.e.,
    d ( x , v ) = 1 if x G 1 L otherwise .
The corresponding minimum spanning tree is composed by k spanning trees inside each group (each with cost 0), a spanning tree of the group composed of exactly k 1 edges each with cost 1 + ϵ and one single edge from G 1 to v. Now, if we use a matroid to find a minimum spanning forest, we will find a correct minimum base B * in which we remove all the k 1 edges with cost 1 + ϵ . This base has a weight of 1 and the cost of the cluster is 1. Now, supposing that we run the ( 1 + ϵ )-approximation algorithm in order to find a minimum base, this gives us a base B ^ where we remove from the minimum spanning tree k 2 edges of cost 1 + ϵ and the edge with cost 1. This is possible since
d ( B ^ ) = 1 + ϵ ( 1 + ϵ ) d ( B * ) .
However, the cost of the corresponding cluster is at least m, since we put two groups in the same cluster.
The above example shows that, even with an arbitrary small approximation, we can have cases where the error in the final solution (clustering) is unbounded.

3.2. Phase 1: Compute a “Good” Spanning Tree

In the previous section, we have assumed that we have given a MST T. This task can be solved with the standard MST algorithm using O ( n 2 ) high-cost operations. A natural question is thus whether there is a strategy that uses less than o ( n 2 ) high-cost operations. This seems difficult if we insist on computing a MST, since any edge in the MST can potentially appear in every position of the sorted sequence of edges.
A key observation is that we do not have to necessarily compute a MST, but we only need a spanning tree from which we can recover the optimal solution removing k 1 edges. This is captured by the following definition.
Definition 3
(k-spanning tree). Let ( X , d ) a metric space and let C 1 * , , C k * be an optimal k-clustering. A tree T of this metric space is k-spanning if every subtree T [ C i * ] induced by C i * is connected.
The definition essentially says that, by removing k 1 edges from a k-spanning tree, we obtain a forest whose connected components correspond to the optimal solution. Implicit in the proof of [3] (see also [4]), in order to successfully compute the optimum, it suffices to have a k-spanning tree (and not necessarily a MST one).
Theorem 3
(due to [3]). The modification of Single-Linkage++ algorithm, where, in Phase 1, we compute a k-spanning tree, finds the unique optimal k-clustering in every γ-pertubation-stable instance, with γ 2 .
We thus consider how to compute a k-spanning tree, instead of an exact MST. We first show the following key corollary of Lemma 1.
Corollary 1.
For any γ-perturbation-stable metric space ( X , d ) , with γ 2 , and with optimal solution C 1 * , , C k * the following holds. If x C i * and y C i * , for some cluster C i * , then it must hold that
d ( x , y ) > ( γ 1 ) d ( x , c i * )
where c i * is the centroid of C i * .
Proof. 
Since y C j * for some j different from i, we have
d ( x , y ) d ( x , c j * ) d ( y , c j * ) > γ d ( x , c i * ) d ( y , c i * ) γ γ d ( x , c i * ) d ( x , y ) + d ( x , c i * ) γ
where the first and last inequality are due to the triangle inequality, and the second inequality follows from Lemma 1. Then, by rearranging the terms, we obtain
1 + 1 γ d ( x , y ) > ( γ 1 γ ) d ( x , c i * )
which implies the corollary after some simplification. □
Now, we are ready to prove a sufficient condition for a tree to be k-spanning, which we shall use below to derive an algorithm good performance guarantee.
Lemma 5
(sufficient condition for k-spanning tree). Consider a γ-perturbation-stable metric space (X,d), with γ > 2 . Any ( 1 + ϵ ) -approximation T ^ of the MST T, with ϵ ( γ 2 ) n 1 w and w : = d min ( X ) d max ( X ) , is a k-spanning tree.
Proof. 
By contradiction, suppose T ^ is not a k-spanning tree. We show that it cannot be a ( 1 + ϵ ) -approximation of the MST T. Since T is not k-spanning, there is an optimal cluster C such that T ^ [ C ] is not connected. Let C be a connected component of T ^ [ C ] that does not contain the centroid of C. Let x C be a node that is connected with some other y C . Consider the tree
T ¯ : = ( T ^ { ( x , y ) } ) { ( x , c ) }
and observe that, for d : = d ( x , c ) , we have
d ( T ¯ ) = d ( T ^ ) d ( x , y ) + d ( x , c ) < d ( T ^ ) ( γ 1 ) d ( x , c ) + d ( x , c ) = d ( T ^ ) ( γ 2 ) d
where the inequality is due to Corollary 1. Since d ( T ) d ( T ¯ ) , we have that
d ( T ^ ) d ( T ) d ( T ^ ) d ( T ¯ ) > d ( T ^ ) d ( T ^ ) ( γ 2 ) d = 1 + ( γ 2 ) d d ( T ^ ) ( γ 2 ) d 1 + ( γ 2 ) d d ( T ^ ) 1 + ( γ 2 ) d min ( X ) ( n 1 ) d max ( X ) .
where the last inequality follows from d d min ( X ) and d ( T ^ ) ( n 1 ) d max ( X ) . This means that, for ϵ : = ( γ 2 ) d min ( X ) ( n 1 ) d max ( X ) , our tree T ^ is not a ( 1 + ϵ )-approximation of the MST T, which contradicts the hypothesis of this lemma. □
By combining Lemma 5 with Lemma 3 (recall that the MST problem is a matroid), we obtain the following result.
Corollary 2.
For any γ-perturbation-stable metric space ( X , d ) , with γ > 2 , a k-spanning tree can be computed using O n log 2 n ( γ 2 ) w , n log n high-low cost operations with w = d min ( X ) d max ( X ) .
Proof. 
By Lemma 5, it is enough to compute a ( 1 + ϵ ) -approximate MST with ϵ = ( γ 2 ) n w . Since the MST is a matroid, Lemma 3 implies the result. □
Remark 3.
Note that the result above depends the values of w and γ. This result provides an alternative method for computing a k-spanning tree, which, in some cases, can be more efficient than using O ( n 2 ) high-cost operations for computing a MST. In particular, the bound in Corollary 2 is better for n log 2 n ( γ 2 ) w n 2 , that is, for d max ( X ) d min ( X ) n log 2 n ( γ 2 ) .
We conclude this section by observing that, in a sense, the above approach cannot be easily improved. In particular, it might be desirable to extend Corollary 2 to some small but constant ϵ > 0 , which would improve the bound and also do not require any knowledge about w and γ . Unfortunately, the following lemma provides a negative answer. Intuitively, this is because the cost of the MST and the cost of the optimal clustering may be very different.
Lemma 6
(limitations of approximate MST). For every ϵ > 0 , there exists a metric space ( X , d ) such that, even if we find a spanning tree T ^ , which is an ( 1 + ϵ ) -approximation of the MST T, this tree T ^ is not k-spanning. In particular, for every subset K of k 1 edges of T ^ , the corresponding clustering has an unbounded error.
Proof. 
Letting ϵ > 0 be arbitrary, we consider the following set of points located on the 1-dimensional Euclidean space:
X = { 0 , 1 , L 2 + 1 , L 2 + 2 , L 1 , L 1 1 } ,
where L 1 is an arbitrary positive number and L 2 = 1 ϵ L 1 . We take d as the Euclidean distance (i.e., the absolute difference between these numbers) and consider k = 3 clusters.
Note that the cost of the MST T is d ( T ) = 3 + L 1 + L 2 and the cost of the optimal cluster C is d ( C ) = 3 . Now, consider the tree T ^ obtained by replacing in T edge ( L 1 , L 1 1 ) with edge ( 0 , L 1 1 ) . The cost of this new tree T ^ is
d ( T ^ ) = 3 + 2 L 1 + L 2 = 3 + L 1 + L 2 + ϵ L 2 ( 1 + ϵ ) d ( T ) ,
and we can see that T ^ is not k-spanning since T [ { L 1 , L 1 1 } ] is not connected. Since T ^ contains three edges of cost L 1 , L 1 + 1 and L 2 L 1 , removing any two edges from T ^ has a cost d ( C ^ ) L 1 . Therefore, the approximation guarantee is at least d ( C ^ ) d ( C ) L 1 3 , which can be made arbitrarily large by increasing L 1 in this construction. □

3.3. Phase 2 of Single-Linkage++ (Removing the Edges)

In this section, we focus on Phase 2 of the algorithm Single-Linkage++, namely, computing the subset set K * of k 1 edges whose removal (from the tree T computed in Phase 1) yields the (unique) optimal clustering. This phase can also be seen as a matroid, though it is more complex than the one of the Single-Linkage (naive) algorithm because of these two issues:
  • In order to evaluate the cost of a candidate solution (set K of edges to remove), we need to compute the centroids of each cluster;
  • The number of elements in the associated matroid is N = n 1 k 1 , though we are interested in extracting only one (the optimal K * ).
Another caveat is that, since we shall try all N = n 1 k 1 subsets, in order to ensure a correct answer with high probability, we need for the subroutine determining the “correct” centroids to succeed with sufficiently high probability.
Definition 4.
An ( α , q ) -centroids procedure is an algorithm, which, on inputting any optimal k-clustering C 1 , , C k , with probability at least 1 q returns a tuple
( c ˜ 1 , , c ˜ k ) = centroids ( C 1 , , C k )
of α-approximate centroids,
C o s t ( C i , c ˜ i ) α C o s t ( C i , c i ) w h e r e c i = arg min x C i C o s t ( C i , x ) .
For α = 1 , we have an exact centroid procedure that returns the optimal centroids c i of each cluster C i .
Definition 5.
Given a k-spanning tree T, an α-approximate removal is a subset of k 1 edges such that the k-clustering obtained by removing these edges from T is α-approximate.
Theorem 4.
Suppose there exists an ( α , q ) -centroids procedure centroids using O h C E N , l C E N high-low operations. Then, there exists an algorithm, which, with probability at least 1 q 1 N , finds an α-approximate removal using O N h C E N + log N , N l C E N + N log N high-low operations in total.
Proof. 
Given any k-spanning tree T, we simply try all possible candidate subsets K of k 1 edges of T, and find the one that corresponds to the optimum as follows. For each candidate subset K ( a ) , with a = 1 , , N , we proceed as follows:
  • Consider the corresponding clustering C 1 ( a ) , , C k ( a ) . This can be done by simply inspecting the nodes of each connected component when removing edges K ( a ) from T.
  • Run the procedure centroids to compute (some) centroids for this candidate clustering, ( c 1 ( a ) , , c k ( a ) ) = centroids ( C 1 ( a ) , , C k ( a ) ) . This gives a set of edges from centroid c i ( a ) to the other elements of the cluster C i ( a ) , for all clusters:
    E ( a ) = { ( x , y ) E x C i ( a ) and y = c i ( a ) for some i } .
We now observe two key properties of these (candidate) centroids. For each cluster C i ( a ) , consider the sum of the distances to its candidate centroid c i ( a ) , and observe that
C o s t ( C ( a ) ) d ( E ( a ) ) = i = 1 k x C i ( a ) d ( x , c i ( a ) ) ,
since c i ( a ) may not be the optimal centroid for C i ( a ) . Moreover, for the optimal K ( o p t ) and the corresponding optimal k-clustering C 1 ( o p t ) , , C k ( o p t ) , with a probability of at least 1 q , centroids returns an α -approximate centroid c i ( o p t ) for each cluster C i ( o p t ) , for all k clusters (Definition 4). That is, the truly optimal centroid c i * = arg min x C i ( o p t ) C o s t ( C i ( o p t ) , x ) satisfies
C o s t ( C i ( o p t ) , c i ( o p t ) ) α C o s t ( C i ( o p t ) )
and therefore
d ( E ( o p t ) ) α C o s t ( C ( o p t ) ) .
This implies that the candidate K ( m i n ) = arg min K ( a ) d ( E ( a ) ) which minimizes d ( E ( a ) ) among the N candidates is indeed an α -approximate removal:
C o s t ( C ( m i n ) ) d ( E ( m i n ) ) d ( E ( o p t ) ) α C o s t ( C ( o p t ) ) .
The total high-low cost to create the list of all N elements is O N h C E N , N l C E N . As observed above, with a probability of at least 1 q , the minimum in this list yields an α -approximate removal. Each pairwise comparison between two elements, say d ( E ( a ) ) and d ( E ( b ) ) , can be done with a single call to our oracles, either O L ( E ( a ) , E ( b ) ) or O H ( E ( a ) , E ( b ) ) . We can thus apply Lemma 4 and, with a probability of at least 1 1 N , find the minimum element using log N , N log N high-low cost operations. The overall high-low cost to find an α -approximate removal is thus
O N h C E N + log N , N l C E N + N log N .
Finally, by the union bound, the probability that we actually find an α -approximate removal is at least 1 q 1 N . □
From the previous result, we can focus on constructing a suitable procedure centroids .

3.3.1. Exact Centroids and Exact Solutions

We begin by considering procedures to compute exact centroids ( α = 1 ), which automatically lead to optimal k-clustering. The first (naive) approach is simply to use only high-cost (exact) operations.

All at High-Cost

We want to find an implementation of centroids using only high-cost operations. For each cluster C i , and two candidate points x , y C i , we can compute
E x = { ( x , z ) E | z C i } and E y = { ( y , z ) E | z C i }
and, by calling O H ( E x , E y ) , we can decide which of these two points is a better centroid. By repeating this for all the points in one cluster, and for all the clusters, we can find the optimal k centroids using only O ( n ) high-cost operations in total. This shows the following.
Observation 2.
The best subset K * can be found using O N n , 0 high-low operations in total, where N = n 1 k 1 .
We next investigate how to reduce the number of high-cost operations by introducing low-cost ones (still aiming for exact centroids).

Using low cost operations

Similarly to what we discussed above, our low-high cost operations are calls to our oracles O L ( E x , E y ) and O H ( E x , E y ) , where x , y C i and E x , E y are as in (3). According to Lemma 4, for each cluster C i consisting of n i = | C i | elements, we can find its optimal centroid (the minimum) using O Δ i log n i , n i log n i high-low cost operations with a probability of at least 1 1 n i Δ i . In order to optimize the overall success probability, we set
Δ i = Δ log n log n i
for a suitable Δ independent of i. The resulting procedure centroids has the following performance guarantees:
  • Success probability at least 1 k / n Δ . For each cluster C i , the probability of finding the optimal centroid is at least 1 ( 1 / n ) Δ since n i Δ i = 2 Δ i log n i = 2 Δ i log n = n Δ . By the union bound over all k clusters, the success probability is at least 1 k / n Δ .
  • High-cost operations O k log n k . The total high-cost of centroids for computing all centroids of any given partition C 1 , , C k of X is
    O i = 1 k Δ i log n i = O k Δ log n .
  • Low-cost operations. We observe that this is
    O i = 1 k n i log n i = O n log n k ,
    where these bounds can be easily obtained using the Lagrange Multiplier.
We have thus obtained an ( 1 , q ) -centroids procedure with q k / n Δ 1 / n Δ 1 , since k n . Taking Δ = 3 log N = O ( k log n ) , we get the following.
Observation 3.
The best subset K * can be found using O N ( k log n ) 2 , N n log n k high-low operations in total, where N = n 1 k 1 .
Remark 4.
Note that the bottleneck in the number of high-cost operations is in the repeated application of procedure centroids for all N possible candidate subsets of edges. Even an improved implementation of centroids using O ( 1 ) high-cost operations would still result in an overall O ( N ) high-cost. On the other hand, once all possible N = n 1 k 1 candidate subsets E Q have been created, one could find the best one using only O log N , N log N high-low cost via Lemma 4. Thus, a much smaller O ( log N ) high-cost is sufficient, as compared to the O ( N ) high-cost.
In the next section, we shall reduce significantly the high-cost by introducing an approximate computation of the clusters which uses zero high-cost operation. We shall see that this results in a (small) factor approximation in the solution.

3.3.2. Approximate Centroids and Approximate Solutions

As we have seen in the previous section (Remark 4), it is quite easy to reduce the number of high-cost queries used to find the best subset once we have computed the function centroids for every partition. The problem is that, even if we improve the high-cost cost of the function centroids to be a constant, since we apply it to O ( N ) different partitions, we still have in total an expensive high-cost. For this reason, we now consider a method to compute approximate centroids without using any high-cost operation, that is, an approximate implementation of centroids with only low-cost operations.
Though this may not give us an exact solution, we shall prove that this shall lead to a 2-approximate clustering. The next result is central for this section.
Lemma 7
(centroid approximation). Given a metric space ( X , d ) , for any cluster C i X , and for any x C i , it holds that
C o s t ( C i , x ) 1 + n i 2 n i D i ( x ) C o s t ( C i )
where n i = | C i | and D i ( x ) is the number of points y in C i which are a better centroid than x,
D i ( x ) : = | { y C i C o s t ( C i , y ) < C o s t ( C i , x ) } | .
Proof. 
Let c i * = arg min z C i C o s t ( C i , z ) be the optimal centroid of C i and the corresponding optimal cost is O = C o s t ( C i ) = C o s t ( C i , c i * ) . Observe that, for every c i C i , we have
C o s t ( C i , c i ) = y C i d ( y , c i ) = y C i c i y d ( y , c i ) y C i c i y ( d ( y , c i * ) + d ( c i * , c i ) ) = ( y C i c i y d ( y , c i * ) ) + ( n 1 ) d ( c i * , c i ) = y C i d ( y , c i * ) + ( n 2 ) d ( c i , c i * ) = O + ( n 2 ) d ( c i , c i * ) .
Now, let O ^ = C o s t ( C i , x ) , and let us partition C i into the following two sets:
H : = { y C i C o s t ( C i , y ) < C o s t ( C i , x ) } , H + : = { y C i C o s t ( C i , y ) C o s t ( C i , x ) } .
By definition | H | = D i ( x ) and | H + | = n i D i ( x ) . Moreover, for any y H + , we have that
O ^ = C o s t ( C i , x ) C o s t ( C i , y ) O + ( n i 2 ) d ( y , c i * )
that is
d ( y , c * ) O ^ O n i 2 .
Using the fact that the distances between two points are nonnegative, we get the following lower bound on the optimal cost of the cluster,
O = y C i d ( y , c i * ) = y H d ( y , c i * ) + y H + d ( y , c i * ) y H + d ( y , c * ) y H + O ^ O n i 2 = | H + | O ^ O n i 2 = n i D n i 2 ( O ^ O )
and, by rearranging the terms, we obtain
O ^ 1 + n i 2 n i D i ( x ) O ,
which completes the proof. □
Intuitively, by combining the lemma above with Lemma 2, our approximation depends on the dislocation D i ( x ) that the sorting procedure (using only low-cost operations) guarantees when applied to the n i = | C i | elements of cluster C i . This leads to the following result.
Lemma 8.
For any parameter Δ 0 , there exists an ( α , q ) -centroids procedure with using O 0 , n log n high-low cost operations, with
q k · n Δ and α 1 + n min 2 n min O ( Δ log n )
where n min is the size of the smallest cluster in the optimal k-clustering C 1 , , C k , that is, n min = min i { n i } for n i = | C i | .
Proof. 
We choose Δ i as in the previous subsection, eq. (4), so that Δ i log n i = Δ log n for a fixed Δ 0 independent of i. By Lemma 2, for each cluster C i , with probability at least 1 n i Δ i = 1 n Δ , we can find an element x C i such that D i ( x ) O ( Δ i log n i ) = O ( Δ log n ) . By the union bound, the overall probability of computing such points for all k clusters is at least 1 k · n Δ , hence q = k · n Δ . Note also that we use only low-cost operations, in particular, the total low-cost operations are i = 1 k O ( n i log n i ) = O ( n log n ) .
We next argue about the approximation guarantee. By Lemma 7, for each cluster C i , the corresponding centroid x of our procedure satisfies
C o s t ( C i , x ) C o s t ( C i ) 1 + n i 2 n i D i ( x ) = 1 + n i 2 n i O ( Δ log n ) .
The latter is maximized when n i is the smallest, that is, for n i = n min = min i { n i } . □

3.4. Dynamic Programming

A dynamic programming to find the best clustering giving a k-spanning tree can be found in [3]. In this section, we shall simplify their dynamic programming (which works for a more general class of problems) in order to adapt it to our oracles model.
In general, it is rather difficult to work with “approximate” solutions with dynamic programming, since the errors accumulate during the execution of the algorithm, and we do not have a clear way to control this (for a rare example, a dynamic programming algorithm working with errors, see [31]). Therefore, our approach will be to always produce correct intermediate results, while still trying to limit the number of high-cost operations.

3.4.1. Basic Notation and Adaptations

First of all, observe that the algorithm in [3] applies to a more general problem in which we have a generic function g ( u , d ) applied to the distance function d, and a function f ( c ) that indicates the cost of having c as a centroid. Therefore, we instantiate these two functions according to our problem, that is, we consider g ( u , d ) d and f ( c ) 0 .
Before starting with the actual algorithm, we have to transform our tree T X = ( X , E X ) in a binary tree B X = ( V B , E B ) using the procedure described in [3] (select an arbitrary node as the root and then add nodes in a top-down fashion to get a binary tree). In the following, we use X as the set of the original points and U as the set of the points that we add so that V B = X U . One can easily check that, for a node x that has c ( x ) children, we have to add c ( x ) 2 nodes if x has more than two children and one if c ( x ) equals one. Note that the number of nodes is | V B | = | X U | = O ( n ) , and thus we can still denote with n the input size and count the number of queries as a function of n.
We define L the set that contains all the leaves of B X , and we create two sequences S and S of the nodes in reverse breadth first order from the root, where S contains only the non leaf nodes.
Then, we build a data structure C o s t where, at each key ( u , j , c ) , we encode information to evaluate the optimal solution for partitioning the subtree B u rooted at u into j k clusters with c being the centroid of node u (here B u is the set of nodes in subtree rooted at u). In particular, for every key ( u , j , c ) , the data structure C o s t contains a set E ( u , j , c ) of edges of the tree that represents the connections of nodes in subtree B u to their clusters in that particular solution, and an integer variable s ( u , j , c ) that represents how many of these centers are from U, and thus how “illegal" is the solution (we have to add this variable since we set f to be always equal to zero). This is in contrast with the original dynamic programming in [3], where we associate only a single integer to every key.
The last thing we have to do is to modify the output of the algorithm. In [3], it how to compute the total cost of the perfect clustering is the only thing mentioned. This of course can not be done with our model since we have no information about the absolute distance between the points. Thus, we have to further modify it in order to obtain the cluster centroid as output of the algorithm, so that we can easily compute the cluster by associating each point to the closest centroid. In order to achieve that, we create a second data structure C e n t r o i d indexed by keys ( u , k , c ) , where each entry contains a set of the best centroids for partitioning the sub tree rooted at u in k clusters.

3.4.2. The Actual Algorithm

We are now in a position to describe the dynamic programming algorithm based on the binary tree B X and the two data structures described above.

Initialization

Similarly to [3], we have to initialize the leaves. In particular, we associate in the data structure C o s t , for each key ( u , 1 , c ) , where u L and c X , a list that contains only the edge ( u , c ) and the variable s is set to 1 if c U , and zero otherwise. For all the key ( u , j , c ) , with u L , c X and j > 1 , we associate an empty set and 1 to the variable s. Furthermore, in the second data structure C e n t r o i d , we associate an empty set to all the previous keys.

The Algorithm

We traverse the key in the order for j from 1 to k, for node u S and c S , and we compute the optimum for each key ( u , j , c ) . In the original dynamic programming [3], this is done by searching for the couple ( l , j 1 , c 1 ) + ( r , j 2 , c 2 ) that corresponds to the best value, where l and r are the two children of u in the binary tree. Since we do not have access to the individual value for a couple of keys, we create a list containing all possible couples ( l , j 1 , c 1 ) + ( r , j 2 , c 2 ) , and then we extract the best solution. In order to perform the latter task, we have to be able to compare two candidate couples for the same left and right child l and r, say
( l , j 1 , c 1 ) + ( r , j 2 , c 2 ) and ( l , j 1 , c 1 ) + ( r , j 2 , c 2 ) .
We show that this can be done by using a single call of our oracle in the following way. For a generic couple, ( l , j 1 , c 1 ) + ( r , j 2 , c 2 ) , the data structure associated with the two elements of this couple consists of
( E ( l , j 1 , c 1 ) , s ( l , j 1 , c 1 ) ) = C o s t ( l , j 1 , c 1 ) ( E ( r , j 2 , c 2 ) , s ( r , j 2 , c 2 ) ) = C o s t ( r , j 2 , c 2 )
To express the dynamic programming, it is useful to introduce the following auxiliary functions depending on ( u , c ) :
E ( ( l , j 1 , c 1 ) + ( r , j 2 , c 2 ) ) : = E ( l , j 1 , c 1 ) E ( r , j 2 , c 2 ) s ( ( l , j 1 , c 1 ) + ( r , j 2 , c 2 ) ) : = s ( l , j 1 , c 1 ) + s ( r , j 2 , c 2 ) f c , c ^ : = 1 if c = c ^ and c U 0 else
We are now in a position to express how to choose between the two couples in (5). Consider
E = E ( ( l , j 1 , c 1 ) + ( r , j 2 , c 2 ) ) s = s ( ( l , j 1 , c 1 ) + ( r , j 2 , c 2 ) ) f c , c 1 f c , c 2 E = E ( ( l , j 1 , c 1 ) + ( r , j 2 , c 2 ) ) s = s ( ( l , j 1 , c 1 ) + ( r , j 2 , c 2 ) ) f c , c 1 f c , c 2
If s s , then we choose the couple corresponding to the minimum between these two values. If instead s = s , then we evaluate the corresponding cost by performing an oracle query to E and E after removing all edges incident to some point in U. In this case, we return the couple with a minimum cost.
The above criterion extends naturally to subsets of couples, and it will replace the basic ‘min’ operation in the original dynamic programming formulation [3].
Having determined the optimal couple ( l , j 1 * , c 1 * ) + ( r , j 2 * , c 2 * ) for the key ( u , j , c ) , we set the corresponding content of data structures C o s t and C e n t r o i d as follows:
E ( u , j , c ) = E ( ( l , j 1 * , c 1 * ) + ( r , j 2 * , c 2 * ) ) { ( u , c ) } s ( u , j , c ) = s ( ( l , j 1 * , c 1 * ) + ( r , j 2 * , c 2 * ) ) + f c , c C e n t r o i d ( u , j , c ) = C e n t r o i d ( l , j 1 * , c 1 * ) C e n t r o i d ( r , j 2 * , c 2 * ) { c 1 * , c 2 * } .
After we have filled all possible keys ( u , j , c ) , we simply have to find the best value of c * among all keys ( r o o t , k , c ) using the same method, and return the set associated with C e n t r o i d ( r o o t , k , c * ) .

Analysis of High-Low Cost

The total running time is given by the central part of the algorithm. Let u , j , c be the length of the list that contains the candidate couples associated with the key ( u , j , c ) . We have to extract the minimum every time, which can be done using O log , log the high-low cost, for = u , j , c (Lemma 4). Since the length u , j , c is O ( j | B u l | | B u r | ) = O ( k n 2 ) , and the overall high-low cost is
j = 1 k u S c S O log u , j , c , u , j , c log u , j , c = j = 1 k u S c S O log ( k n 2 ) , k n 2 log ( k n 2 ) = O k n 2 log n , k 2 n 4 log n .
Though the high-cost is worse than the previous result, for k large, this bound is exponentially better in terms of low-cost. Finally, observe that we can invoke Lemma 4 with parameter Δ = Θ ( log n ) so that each of these operations fails with a probability of at most n a for sufficiently large a, so that the overall procedure succeeds with high probability (union bound), and the overall complexity is as above (again using u , j , c = O ( k n 2 ) ).

4. Same Cluster Queries

In this section, we investigate so-called same-cluster queries: Given two points in our metric space ( X , d ) , the corresponding same-cluster query returns “yes” if and only if the two points are in the same cluster in the optimal solution. Of course, this type of query is extremely powerful, though rather difficult to implement in general. In Section 4.1, we first show that, under certain conditions, we can simulate same-cluster queries in our model. In Section 4.2, we present algorithms which find the optimal k-clustering using “few” same-cluster queries—these use “black-box” and thus the algorithm can be applied to any setting where such queries are available but perhaps even more expensive than comparisons in our model.

4.1. Small-Radius and Same-Cluster Queries

Observe that Corollary 1 says that, if two points are not in the same cluster, then their distance must be larger than ( γ 1 ) times the distance of one point to its optimal centroid. This condition does not automatically give a same-cluster query because it requires the knowledge of the optimal centroids, and it does not give an ‘if and only if” characterization. As we show below, under some additional assumptions, we can obtain a characterization for two points being in the same cluster which does not involve the optimal centroids (see the next definition and lemma). This in turn implies that we can build a same-cluster query.
Definition 6
( γ -radius optimal clusters). Let d max ( c e n t r o i d ) be the maximum radius of the clusters in the optimal solution, that is, d max ( c e n t r o i d ) = max x X d ( x , c i ( x ) * ) where c i ( x ) * is the centroid, the cluster containing x in the optimal solution. We say that a γ-perturbation-stable metric space ( X , d ) has γ-radius optimal clusters if
2 d max ( c e n t r o i d ) ( γ 1 ) d min ( X )
where d min ( X ) = min x y d ( x , y ) is the minimum distance between two points.
Lemma 9
(same cluster condition). For any γ-perturbation-stable metric space ( X , d ) which has γ-radius optimal clusters, with γ 2 , the following holds. For every x , y X ,
d ( x , y ) ( γ 1 ) d min ( X ) x , y C i * for some i ,
where C 1 * , , C k * is the optimal solution.
Proof. 
We first prove the ( ) direction:
d ( x , y ) ( γ 1 ) d min ( X ) d ( x , y ) ( γ 1 ) d ( x , c i ( x ) * ) y C i ( x ) * ,
where C i ( x ) * and c i ( x ) * are the cluster and the centroid of x in the optimal solution, respectively, and the second implication is given by the contrapositive of Corollary 1.
In order to prove the other direction ( ) , consider arbitrary x and y in the same cluster with centroid c * and observe that
d ( x , y ) d ( x , c * ) + d ( y , c * ) d min ( X ) ( γ 1 ) 2 + d min ( X ) ( γ 1 ) 2 = d min ( X ) ( γ 1 ) ,
where the first inequality is by triangle inequality (metric space) and the second by the γ -radius optimal clusters assumption (Definition 6). □

Implementing a Same-Cluster Query

According to Lemma 9, our same-cluster query, upon inputting two points x and y, is simply checking whether d ( x , y ) ( γ 1 ) d min ( X ) holds. We are now in a position to construct our same cluster query. In order to do so, we shall allow more general queries to compare pairwise distances multiplied by scalars, that is,
O H ( α ) ( e 1 , e 2 ) = + 1 if d ( e 1 ) > α d ( e 2 ) 1 otherwise
for any two pairs of points e 1 = ( x 1 , y 1 ) and e 1 = ( x 2 , y 2 ) . The cheap but erroneous counterpart O L ( α ) ( · ) is defined similarly, and its answers are wrong with the same probability p 1 / 16 as in our oracle O L ( · ) . Observe that we can check whether d ( x , y ) ( γ 1 ) d min ( X ) by one call to the above oracle with e 1 = ( x , y ) and e min = arg min x y d ( x , y ) being from the pair of points at a minimal distance, d min ( X ) = d ( e min ) . Lemma 9 thus implies the following:
Lemma 10.
For every γ-perturbation-stable metric space ( X , d ) which has γ-radius optimal clusters, γ 2 , the following query
S C Q H ( γ ) ( x , y ) : = O H ( γ 1 ) ( ( x , y ) , e min )
is an exact same-cluster query, where e min = arg min x y d ( x , y ) . The analogous cheap but erroneous query S C Q L ( γ ) ( x , y ) : = O L ( γ 1 ) ( ( x , y ) , e min ) is a same-cluster query that is correct with probability at least 1 p .
Note that these queries are similar to those used in [8,9], though these works consider “triples” of points, e 1 = ( x , y ) and e 2 = ( x , z ) , and the error model is different. We next discuss how expensive is to build such queries and whether they can be built based on our initial query model.
Observation 4.
Observe that the pair of points e min = arg min x y d ( x , y ) can be determined once and for allusing O log n , n 2 log n high-low cost (Lemma 4). Therefore, any algorithm using SCQs only requires this additional high-low cost to be implemented.
Observation 5.
For γ = 2 , the above query is a simple query between distances. Moreover, for any integer γ 2 , the query can be simulated by considering a multiset of pairs:
S C Q t y p e ( x , y ) : = O t y p e ( { ( x , y ) } , E γ ) t y p e { L , H }
where E γ consists of γ 1 copies of e min .
Assuming an oracle like in (6) is available, and each exact/erroneous SCQ corresponds to one high/low cost operation. Alternatively, if we allow multiset comparisons in our original oracle model, the same also holds true (without directly appealing on the oracle in (6)). Finally, for γ = 2 , we simply can use the original oracle.
Remark 5.
Though every γ-perturbation-stable instance with γ 2 is also 2-perturbation-stable instance, since we require the additional ‘γ-radius optimal clusters’ condition there might be instances that satisfy the hypothesis to obtain SCQs above only for some γ > 2 .
In the sequel, we shall devise algorithms that attempt to use as few as possible SCQs. In their analysis, we shall then account explicitly for the overall number of such queries and the rest of high-low cost operations derived from queries to our original oracle model. Whenever the original oracle model suffices to simulate the S C Q H and/or S C Q L , each SCQ corresponds to a single operation (high or low cost).

4.2. An Algorithm Using Few SCQs

We next describe a simple algorithm which uses much less SCQs, though a comparable number of high-low cost operations. The algorithm works for certain instances, and its performance is summarized in the following theorem.
Theorem 5.
For any γ 2 and for any positive integer n 1 , there exists an algorithm with the following performance guarantees. The algorithm computes the optimal k-clustering, with a probability of at least 1 3 n , on input any γ-perturbation-stable metric space ( X , d ) which has γ-radius optimal clusters, and such that every cluster in the optimal solution has a size of at least n 1 . Moreover, the algorithm uses only the following number of exact same-cluster queries and high-low cost operations parameterized in p 1 = n 1 n :
  • The algorithm uses O ( k log k p 1 ) exact same-cluster queries, which is independent of n for approximately balanced clusters, that is, n 1 = Θ ( n / k ) .
  • The additional high-low cost operations used by the algorithm is
    O ( log n ) 2 p 1 , ( log k p 1 + n ) k log n .

4.2.1. The Algorithm

We first observe that Corollary 1, in addition to the condition used to simulate same-cluster queries in Lemma 9, implies another useful property that we can use to reduce the number of high-cost queries. Intuitively, under the same conditions of Lemma 9, we have that every point is closer to all points in its own cluster than to any of the points outside.
Corollary 3.
For any γ-perturbation-stable metric space ( X , d ) that has γ-radius optimal clusters, with γ 2 , the following holds. For every optimal cluster C * , every x , y C * and z C * , it holds that d ( x , y ) < d ( x , z ) .
Proof. 
Let x , y , z be arbitrary points of X, with x , y in the same cluster C * with centroid c * . Then, we have that
d ( x , y ) ( γ 1 ) d min ( X ) ( γ 1 ) d ( x , c * ) < d ( x , z )
where the last inequality is given by Corollary 1. □
Remark 6
(main intuition). Corollary 3 means that, in order to identify a cluster C i * , instead of its optimal centroid, we can simply determine an arbitrary point x i and assign this point as a (representative) centroid of that cluster C i * , for all k clusters. Then, we can map every other points to the closest (representative) centroid.
The above result suggests the following natural algorithm (the implementation of the single steps is described below together with the analysis):
Algorithms 14 00055 i003
The idea is to pick random points from X until we have a different centroid for each cluster, and then simply map all the remaining points to the closest centroid. In the first step, for every new point we extract, we use the (exact) same-cluster query S C Q H to determine whether this point belongs to one of the clusters for which we already have found one point.

Analysis of the First Step

The analysis is based on the well-known Coupon Collector’s Problem (CCP) [32,33]. In this step, we can safely extract points with a replacement as in the CCP problem (see Lemma 11 below) since extracting the same point more than once has no effect on the stopping condition (all clusters have at least one point, and having more than one point or the same point several times does not affect this step).
Lemma 11.
Consider the process in which we repeatedly draw a coupon from a finite collection of k coupons according to a probability distribution p = ( p 1 , p 2 , , p k ) , meaning that, at every step, p h ( 0 , 1 ) is the probability of picking the h-th coupon. Then, the following bounds on the expected number of steps C ( p ) required to collect at least c copies of each of the k coupons are known (for p 1 p 2 p k ) . For the (approximately) uniform case, that is, p i = Θ ( 1 k ) ,
C ( p ) = Θ ( k log k )
and, more in general, the following upper bounds hold (see [32])
C ( p ) H k p 1 = O ( log k p 1 ) ; C ( p ) j = 1 k 1 j · p j ; C ( p ) j = 1 k 1 / = 1 j 1 p
where H k is the k-th harmonic number.
Lemma 12.
The expected number of exact SCQs in Step 1 of the Coupon–Collector Clustering algorithm is O ( k C ( p ) ) , where p = ( p 1 , p 2 , , p k ) is given by p i = n i / n and n i = | C i * | , and C 1 * , C k * being the optimal k-clustering,
Proof. 
We simply observe that, for each point, we extract from X that the probability that it belongs to cluster C i * is n i / n = p i . Moreover, for every newly picked point, we have to perform a same-cluster query S C Q H with the centroid of each cluster that we have already found in order to be sure that the new point belongs really to a new cluster. This costs at most the number of clusters k. □
Observation 6.
The expected number of exact SCQs in Step 1 is
O ( k C ( p ) ) = O k min log k p 1 , j = 1 k 1 j · p j , j = 1 k 1 / = 1 j 1 p O k log k p 1 .

Analysis of the Second Step

In this step, we exact points without replacements. The analysis is essentially based on the generalization of the CCP problem in which we want to obtain C 1 copies of each of the k coupons. This problem is known as the double Dixie cup problem [33,34]. The variant in which points are chosen without replacement has been recently used for a similar clustering problem in [30]. We borrow their concentration result:
Lemma 13
(Theorem 3.3 in [30]). For any subset of k clusters of size n 1 n 2 n k , the following holds. Suppose m points are sampled uniformly at random without replacement from X. Denote S i as the number of samples filled in cluster i after this process. Then, the probability that each cluster is filled with at least C = m p 1 2 points is bounded as
Pr { min S i C } 1 k exp ( C 4 ) ,
where p 1 = n 1 n and n 1 is the size of the smallest cluster among the clusters under consideration.
Corollary 4.
For any C 8 log n , by sampling at least m = 2 C / p 1 = 2 C n / n 1 , the probability that in every cluster we have selected at least C points is at least 1 k n 2 1 1 n .
Given that in each each cluster we select at least C points, the only thing that we have to do is to map all selected points to their correct clusters. For an arbitrary point x, we have to find the closest centroid. By Lemma 4, this requires O Δ log k , k log k high-low cost for each point x, and the probability of success is at least 1 1 k Δ . By the union bound, all m = Θ ( n log n n 1 ) points are then mapped to the correct cluster with probability at least 1 m k Δ . Therefore, by Corollary 4, the union bound implies that Phase 2 terminates with at least C points in each cluster of the optimal solution with a probability at least 1 1 n m k Δ . For Δ = log ( n m ) log k , we have m k Δ = 1 n and the previous probability is at least 1 2 n . Moreover, the total high-low cost of this step is
m · O Δ log k , k log k = O m log ( m n ) , m k log k
and since m = Θ ( n log n n 1 ) = Θ ( log n p 1 ) we have log ( m n ) = O ( log n ) that is
= O ( log n ) 2 p 1 , log n p 1 k log k .
Remark 7.
Similarly to what was done in the previous sections, one might think of using no high-cost operations at all, by introducing some approximation. In this case, however, this would mean to assign a point to an incorrect cluster. This might be a problem since assigning a point even to the second closest centroid may result in an unbounded error.

Analysis of the Third Step

Recall that every point is closer to all the points in its optimal cluster than to any point outside that cluster (Corollary 3). Since we have found C centroids for each cluster, for every remaining point x, we can assign it to a cluster using a simple “majority voting” using only low-cost operations. Specifically, for any two clusters and two corresponding centroids, say y C i * and z C j * , we use a low-cost comparison to check whether d ( x , y ) < d ( x , z ) . Fix a subset of C C of centroids in each of the k clusters. Let W i j denote the number of “wins” of C i * vs C j * , that is, the number of pairs of fixed centroids ( y , z ) with y C i * and z C j * such that the low-cost comparison ‘ d ( x , z ) > d ( x , y ) ’ is correct (i.e., the strict inequality holds according to Corollary 3). Since W i j is the sum of N = C 2 Bernoulli random variables such that Pr [ X e = 1 ] 1 p , standard Chernoff bounds imply the following:
Lemma 14.
For C i * being the correct cluster of x and any other cluster C j * ,
Pr [ W i j N / 2 ] e N f p ,
where N = C 2 , and f p is a constant depending on p ( 0 , 1 / 2 ) only.
Proof. 
For μ = E [ W i j ] ( 1 p ) N and for δ = 1 1 2 ( 1 p ) , we have (note that δ > 0 since p < 1 / 2 )
Pr W i j N / 2 = Pr [ W i j ( 1 δ ) ( 1 p ) N ] Pr [ W i j ( 1 δ ) μ ] e μ δ 2 2 e ( 1 p ) N δ 2 2
where the first and third inequalities are from μ ( 1 p ) N and the second inequality is the Chernoff bound. Finally, note that
( 1 p ) δ 2 = ( 1 p ) 1 1 2 ( 1 p ) 2 = ( 1 2 p ) 2 4 ( 1 p ) = ( 2 ϵ ) 2 4 ( 1 / 2 + ϵ ) = ϵ 2 1 / 2 + ϵ
for p = 1 / 2 ϵ . Thus, f p = ϵ 2 1 + 2 ϵ . □
By the union bound, the probability that the correct cluster C i * of x wins against all other clusters is at least 1 k e N f p . In particular, since in the previous step we have chosen C = Ω ( log n ) , we can choose a suitable C = Θ ( C ) = Θ ( log n ) such that N = C 2 3 log n f p = Ω ( log n ) and the fail probability is bounded as k e N f p 1 / n 2 . In order to find the optimal cluster for x, we use a simple “tournament” where we compare the current best cluster with the next one that we did not yet consider. Comparing two clusters costs O ( N ) = O ( C ) comparisons, and thus overall we use O ( k C 2 ) low-cost operations only for every x. Putting things together, we have the following:
Observation 7.
In Step 3, with a probability of at least 1 1 n , all points are assigned to their optimal clusters using O ( k n C ) = O ( k n log n ) low-cost operations only.

Putting Things Together (Proof of Theorem 5)

The number of SCQs and the high-low cost complexity of the three steps are as follows:
Step 1 : O k log k p 1 exact SCQ only Step 2 : O ( log n ) 2 p 1 , log n p 1 k log k Step 3 : O 0 , k n log n
The overall high-low cost is thus due to the last two steps only, and it is as in (7). Finally, each of the three steps succeeds with a probability of at least 1 1 n , conditioned to the previous step being successful. By the union bound, the overall success probability is at most 1 3 n .

4.2.2. Extensions of Theorem 5

In the algorithm presented above, we need to set a parameter n 1 and then the algorithm works for all inputs where the smallest cluster has a size of at least n 1 . We next observe that this prior knowledge is not needed in Step 1 of the algorithm, and that Step 2 can be modified to work without this assumption on the input. Specifically:
  • Lemma 12 still holds since in Step 1 we simply continue extracting random points until every cluster has at least one point. As already observed, we can safely extract points with replacement as in the CCP problem since extracting the same point more than once has no effect on the stopping condition. Therefore, all upper bounds on the CCP problem also apply (see Lemma 11).
  • In Step 2, we can use high-cost comparisons to assign every point to its correct cluster. Since these operations are always correct, we no longer need C = Ω ( log n ) points per cluster, and can in fact reduce them to C = Θ ( log n ) , which is the number we need in Step 3 to have a high-probability of success. Selecting m 2 C / p 1 points guarantees that the probability that not all clusters have at least C points is at most q ( m ) = k exp ( C 4 ) = k exp ( p 1 m 8 ) . For a suitable m = Θ ( log k p 1 ) , this probability is constant, say q ( m ) < 1 / 3 . Therefore, in expectation, we have to wait a constant number of “rounds” of m extractions each, meaning that the number of extractions needed before reaching C points in each cluster is O ( m ) = O ( C + log k p 1 ) . Since for every point we have to check all clusters, the expected high-low cost operations of Phase 2 is O k log n + log k p 1 , 0 .
  • Step 3 is as before and it takes O 0 , k n log n high-low cost operations. Its success probability is at least 1 1 n , while the two previous steps are always successful since we use high-cost operations.
This leads to the following extension of Theorem 5:
Theorem 6.
For any γ-perturbation-stable metric space ( X , d ) which has γ-radius optimal clusters, γ 2 , the optimal k-clustering can be computed using only the following number of exact same-cluster queries and high-low cost operations parameterized in p 1 = n 1 n , where n 1 is the size of the smallest optimal cluster:
  • The algorithm uses O ( k C ( p ) ) O ( k log k p 1 ) exact same-cluster queries in expectation, where p depends on the size of the optimal clusters (Lemma 11), and C ( p ) is the corresponding coupon collector bound (Lemma 12 and Observation 6).
  • The additional high-low cost operations used by the algorithm is
    O k log n + log k p 1 , k n log n .
    If the size of the smallest cluster is known, then the high-low cost upper bound in (7) also applies.

5. Conclusions, Extensions, and Open Questions

In this work, we have shown algorithms for clustering that use a suitable combination of high-cost and low-cost comparison (ordinal) operations on the underline distance (dissimilarity) function of the data points. Some of the results are based on a popular graph-based Single-Linkage algorithm at its enhanced version Single-Linkage++ which has provable optimal performance in γ -pertubation-stable instances [1,2,3]. Our results apply to k-medoids clustering, and use comparisons between groups (sum) of distances according to (1). Though this is a rather rich model, and some of these queries may be difficult to realize in practice, some of the results can be extended to simpler queries as we explain below (in fact, the simpler Single-Linkage algorithm uses very simple queries, and the two algorithms differ only in the edge-removal step—Phase 2). The second set of results we provide are algorithms that make use of a rather limited number of exact same-cluster queries, in addition to the comparison high-low cost operations of our model. For example, for instances where the smallest optimal cluster has size n 1 = Ω ( n k ) , there is an algorithm using O ( k log k ) same-cluster queries, O ( k log 2 n ) high-cost operations, and O ( k n log n ) low-cost operations (Theorem 5 and Theorem 6 provide general trade-offs and also deal with the case n 1 not being known, respectively).
In the remainder of this section, we shall discuss further (possible) extensions of our results, including different query models and different error models that can be considered. We conclude with a discussion on interesting open questions and their relation to the above aspects.

5.1. Query Model

Some of our algorithms use only certain types of queries. In particular, we have the following:
  • While Phase 1 of the algorithm seems already quite expensive in terms of high-cost operations, we note that it makes use of much simpler queries “ d ( x , y ) > d ( z , w ) ?” like in [14]. This is also the case for the Single-Linkage algorithm (Phase 2 naive), which uses very few such high-cost queries (Table 1). While our implementation combines exact (high-cost) and noisy (low-cost) queries, the algorithms in [14] use only noisy queries, though under a different noisy model related to certain planted instances.
  • The first exact implementation of Single-Linkage++ (Phase 2) requires computing/ estimating the cost of the clusters given their centroids (3). Instead, the dynamic programming version makes use of the full power of the model. Indeed, our approximation (Phase 2 APX in Table 1) is based on the idea that queries/algorithms that approximately compute the centers (medois) are enough. Definition 4 and Theorem 4 suggest that a query model/algorithm which allows for approximating the centers, and their costs (3), yields an approximate implementation of Phase 2 of Single-Linkage++.
  • The Coupon–Collector Clustering algorithm using same-cluster queries (Section 4) uses very simple comparison-queries, namely “ d ( x , y ) < d ( x , z ) ” (second and third steps), in addition to the same-cluster queries (first step). The latter can either be available or can be simulated by a richer class of “scalar” queries in (6). As already observed, these are slightly more complex than those in [8,9].
  • The model can be refined by introducing a cost dependent on the set of distances involved, e.g., how many they are. Whether comparing just two pairwise distances is easier then comparing groups of distances seems to depend on the application at hand; sometimes comparing groups is easier because they provide a richer “context” which helps.

5.2. Error Model

  • Our error model assumes constant (and independent) error probability across all comparisons. Other error models are also possible and they typically incorporate the “distance values” in the error probabilities (very different values are easier to detect correctly than very close ones). Examples of such models can be found in, e.g., [6,9].
  • Different error models may result in different (perhaps better) dislocation bounds on the approximate sorting problem (Lemma 2). This may directly improve some of our bounds, where the maximum dislocation is the bottleneck for finding the minimum (or top-k-elements) with high probability (if the maximum dislocation becomes D log n , then we need O ( D ) log n high-cost operations for the latter problem, which is used as a subroutine in most of our algorithms).

5.3. Open Questions

The results and the above considerations suggest a number of natural (still open) questions.
  • Reduce the high-cost complexity of Phase 1 of Single-Linkage++. We notice that, in the analysis, the same cluster is considered several times when trying all N = n 1 k 1 edge-removals. A possible direction might be to estimate the number of partitions of a given (spanning) tree, though this problem does not seem to have a general closed form solution [35].
  • Our counterexample shows that a direct improvement of Phase 2 solely based on approximate minimum-spanning tree is not possible. We feel that a finer analysis based on some combinatorial structure of the problem might help.
  • Extend the results to other center-based clustering problems for which the Single-Linkage++ algorithm is optimal [3]. Our scheme based on approximate centers (Definition 4 and Theorem 4) suggests a natural approach in which we simply need queries that approximate the costs.

Author Contributions

Conceptualization, E.B. and P.P.; Formal analysis, E.B. and P. P.; Investigation, E.B. and P.P. Both the authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Balcan, M.F.; Blum, A.; Vempala, S. A Discriminative Framework for Clustering via Similarity Functions. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, Victoria, BC, Canada, 17–20 May 2008; Association for Computing Machinery: New York, NY, USA, 2008. STOC ’08. pp. 671–680. [Google Scholar] [CrossRef] [Green Version]
  2. Awasthi, P.; Blum, A.; Sheffet, O. Center-based clustering under perturbation stability. Inf. Process. Lett. 2012, 112, 49–54. [Google Scholar] [CrossRef] [Green Version]
  3. Angelidakis, H.; Makarychev, K.; Makarychev, Y. Algorithms for Stable and Perturbation-Resilient Problems. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), Montreal, PQ, Canada, 19 June–23 June 2017; pp. 438–451. [Google Scholar] [CrossRef] [Green Version]
  4. Roughgarden, T. CS264: Beyond Worst-Case Analysis Lecture# 6: Perturbation-Stable Clustering. 2017. Available online: http://theory.stanford.edu/~tim/w17/l/l6.pdf (accessed on 8 February 2021).
  5. Tamuz, O.; Liu, C.; Belongie, S.J.; Shamir, O.; Kalai, A. Adaptively Learning the Crowd Kernel. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, DC, USA, 28 June 28–2 July 2011; pp. 673–680. [Google Scholar]
  6. Jain, L.; Jamieson, K.G.; Nowak, R. Finite Sample Prediction and Recovery Bounds for Ordinal Embedding. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 2711–2719. [Google Scholar]
  7. Emamjomeh-Zadeh, E.; Kempe, D. Adaptive hierarchical clustering using ordinal queries. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–10 January 2018; pp. 415–429. [Google Scholar]
  8. Perrot, M.; Esser, P.; Ghoshdastidar, D. Near-optimal comparison based clustering. arXiV 2020, arXiv:2010.03918. [Google Scholar]
  9. Addanki, R.; Galhotra, S.; Saha, B. How to Design Robust Algorithms Using Noisy Comparison Oracles. Available online: https://people.cs.umass.edu/_sainyam/comparison_fullversion.pdf (accessed on 8 February 2021.).
  10. Gilyazev, R.; Turdakov, D.Y. Active Learning and Crowdsourcing: A Survey of Optimization Methods for Data Labeling. Program. Comput. Softw. 2018, 44, 476–491. [Google Scholar] [CrossRef]
  11. Geissmann, B.; Leucci, S.; Liu, C.; Penna, P.; Proietti, G. Dual-Mode Greedy Algorithms Can Save Energy. In Proceedings of the 30th International Symposium on Algorithms and Computation (ISAAC), Shanghai, China, 8–11 December 2019; Volume 149, pp. 1–18. [Google Scholar] [CrossRef]
  12. Xu, Y.; Zhang, H.; Miller, K.; Singh, A.; Dubrawski, A. Noise-tolerant interactive learning using pairwise comparisons. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 2431–2440. [Google Scholar]
  13. Hopkins, M.; Kane, D.; Lovett, S.; Mahajan, G. Noise-tolerant, reliable active classification with comparison queries. arXiv 2020, arXiv:2001.05497. [Google Scholar]
  14. Ghoshdastidar, D.; Perrot, M.; von Luxburg, U. Foundations of Comparison-Based Hierarchical Clustering. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32, pp. 7456–7466. [Google Scholar]
  15. Cohen-addad, V.; Kanade, V.; Mallmann-trenn, F.; Mathieu, C. Hierarchical Clustering: Objective Functions and Algorithms. J. ACM 2019, 66. [Google Scholar] [CrossRef]
  16. Ng, R.; Han, J. CLARANS: A method for clustering objects for spatial data mining. Knowl. Data Eng. IEEE Trans. 2002, 14, 1003–1016. [Google Scholar] [CrossRef] [Green Version]
  17. Park, H.S.; Jun, C.H. A simple and fast algorithm for K-medoids clustering. Expert Syst. Appl. 2009, 36, 3336–3341. [Google Scholar] [CrossRef]
  18. Schubert, E.; Rousseeuw, P.J. Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In International Conference on Similarity Search and Applications; Springer: Cham, Switzerland, 2019; pp. 171–187. [Google Scholar]
  19. Wang, X.; Wang, X.; Wilkes, D.M. An Efficient K-Medoids Clustering Algorithm for Large Scale Data. In Machine Learning-based Natural Scene Recognition for Mobile Robot Localization in An Unknown Environment; Springer: Cham, Switzerland, 2020; pp. 85–108. [Google Scholar]
  20. Jin, X.; Han, J. K-Medoids Clustering. In Encyclopedia of Machine Learning; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2010; pp. 564–565. [Google Scholar] [CrossRef]
  21. Geissmann, B.; Leucci, S.; Liu, C.; Penna, P. Optimal Sorting with Persistent Comparison Errors. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA), Munich/Garching, Germany, 9–11 September 2019. [Google Scholar] [CrossRef]
  22. Kane, D.M.; Lovett, S.; Moran, S.; Zhang, J. Active Classification with Comparison Queries. In Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), Berkeley, CA, USA, 15–17 October 2017; pp. 355–366. [Google Scholar] [CrossRef] [Green Version]
  23. Ukkonen, A. Crowdsourced Correlation Clustering with Relative Distance Comparisons. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 1117–1122. [Google Scholar] [CrossRef] [Green Version]
  24. Ashtiani, H.; Kushagra, S.; Ben-David, S. Clustering with Same-Cluster Queries. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 3216–3224. [Google Scholar]
  25. Ailon, N.; Bhattacharya, A.; Jaiswal, R. Approximate correlation clustering using same-cluster queries. In Latin American Symposium on Theoretical Informatics (LATIN); Springer: Berlin/Heidelberg, Germany, 2018; pp. 14–27. [Google Scholar]
  26. Saha, B.; Subramanian, S. Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA), Munich/Garching, Germany, 9–11 September 2019; Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik: Wadern, Germany, 2019. [Google Scholar]
  27. Bressan, M.; Cesa-Bianchi, N.; Lattanzi, S.; Paudice, A. Exact Recovery of Mangled Clusters with Same-Cluster Queries. arXiV 2020, arXiv:2006.04675. [Google Scholar]
  28. Mazumdar, A.; Saha, B. Clustering with Noisy Queries. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5788–5799. [Google Scholar]
  29. Sanyal, D.; Das, S. On semi-supervised active clustering of stable instances with oracles. Inf. Process. Lett. 2019, 151, 105833. [Google Scholar] [CrossRef]
  30. Chien, I.; Pan, C.; Milenkovic, O. Query k-means clustering and the double dixie cup problem. Adv. Neural Inf. Process. Syst. 2018, 31, 6649–6658. [Google Scholar]
  31. Geissmann, B. Longest Increasing Subsequence Under Persistent Comparison Errors. In Approximation and Online Algorithms; Epstein, L., Erlebach, T., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 259–276. [Google Scholar]
  32. Berenbrink, P.; Sauerwald, T. The Weighted Coupon Collector’s Problem and Applications. In Computing and Combinatorics; Ngo, H.Q., Ed.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 449–458. [Google Scholar]
  33. Doumas, A.V.; Papanicolaou, V.G. The coupon collector’s problem revisited: Generalizing the double Dixie cup problem of Newman and Shepp. ESAIM Probab. Stat. 2016, 20, 367–399. [Google Scholar] [CrossRef] [Green Version]
  34. Newman, D.J.; Shepp, L. The Double Dixie Cup Problem. Am. Math. Mon. 1960, 67, 58–61. [Google Scholar] [CrossRef]
  35. Székely, L.; Wang, H. On subtrees of trees. Adv. Appl. Math.—Advan Appl Math 2005, 34, 138–155. [Google Scholar] [CrossRef] [Green Version]
Table 1. Trade-off for various algorithms for γ -pertubation-stable instances ( γ 2 ) in this work. We distinguish the two-phases of the Single-Linkage++ algorithm (and a simpler naive version) and show the corresponding number of high-cost and low-cost comparisons depending on the size n of the dataset X, the number k of clusters; Parameters d min ( X ) and d max ( X ) denote the minimum and the maximum distance between points in X, respectively.
Table 1. Trade-off for various algorithms for γ -pertubation-stable instances ( γ 2 ) in this work. We distinguish the two-phases of the Single-Linkage++ algorithm (and a simpler naive version) and show the corresponding number of high-cost and low-cost comparisons depending on the size n of the dataset X, the number k of clusters; Parameters d min ( X ) and d max ( X ) denote the minimum and the maximum distance between points in X, respectively.
AlgorithmApprox.High-CostLow-Cost
Single-Linkage (Phase 2 naive) O ( k + log k log n ) O ( n log n )
Single-Linkage++ (Phase 1)1 O ( d max ( X ) ( γ 2 ) d min ( X ) n log 2 n ) O ( n 2 log n )
Single-Linkage++ (Phase 2)1 O ( n 1 k 1 k log n k ) O ( n 1 k 1 log n )
Single-Linkage++ (Phase 2 APX)2 O ( k log n ) O ( n 1 k 1 log n )
Single-Linkage++ (Phase 2 DP)1 O ( k n 2 log n ) O ( k 2 n 4 log n )
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bianchi, E.; Penna, P. Optimal Clustering in Stable Instances Using Combinations of Exact and Noisy Ordinal Queries. Algorithms 2021, 14, 55. https://doi.org/10.3390/a14020055

AMA Style

Bianchi E, Penna P. Optimal Clustering in Stable Instances Using Combinations of Exact and Noisy Ordinal Queries. Algorithms. 2021; 14(2):55. https://doi.org/10.3390/a14020055

Chicago/Turabian Style

Bianchi, Enrico, and Paolo Penna. 2021. "Optimal Clustering in Stable Instances Using Combinations of Exact and Noisy Ordinal Queries" Algorithms 14, no. 2: 55. https://doi.org/10.3390/a14020055

APA Style

Bianchi, E., & Penna, P. (2021). Optimal Clustering in Stable Instances Using Combinations of Exact and Noisy Ordinal Queries. Algorithms, 14(2), 55. https://doi.org/10.3390/a14020055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop