Optimal Clustering in Stable Instances Using Combinations of Exact and Noisy Ordinal Queries
Abstract
:1. Introduction
- Using only ordinal information;
- Dealing with noisy data;
- Allowing expensive operations to remove errors.
Which guarantees are still possible under such model?
What trade-offs between expensive and non-expensive (noisy) operations still allow for finding optimal solutions?
1.1. Our Contribution
1.1.1. Our Model
- Using only ordinal information. Distances or dissimilarities can only be compared and not directly evaluated. One example of such queries is [5,6,7]Which one between y and is more dissimilar to x?Concretely, this means that we can compare with for some dissimilarity function . In our model, we allow for comparing arbitrary groups (sum) of pairwise distances.
- Dealing with noisy results. Comparisons are in general noisy and the resulting error is persistent. Due to measurements, we may report the wrong answer with probability bounded by some error probability , and repeating the same measurements would lead to the same answer [7,8,9]. (In order to import some results from the literature, we shall typically assume . All our results automatically extend to larger p if these prior results in the literature also do.)
- Allowing expensive operations to remove errors. Errors can be fixed by performing some kind of expensive comparison. In this case, we know the answer is correct, but we would like to limit the total number of such expensive operations (see e.g., [10] for a survey on various practical methods).
1.1.2. Algorithms and Bounds for k-Medoids
- In Section 3, we investigate variants of the popular Single-Linkage algorithm, and its enhanced version Single-Linkage++ analyzed in [1,2,3]. (Following [4], we call Single-Linkage the simpler algorithm, which often used in practice, and Single-Linkage++ the enhanced version with provable guarantees [1,2,3]). This algorithm consists of two distinct phases (computing a minimum spanning tree and removing a suitable set of edges to obtain a clustering). A naive implementation, using only high-cost comparisons, would require for such operations for the first phase and for the second one. The trade-offs are summarized in Table 1, where we also consider a naive (simpler) version of the algorithm with no approximation guarantee (this serves to illustrate some key ideas and develop some necessary tools). All other variants are either exact or guarantee a 2-approximation in the worst case. At this point, some comments are in order:
- –
- The overall algorithm consists of a combination of Phase 1 and Phase 2, and thus the overall complexity is given by the sum of these two. The total number of operations (also accounting for internal computations) is comparable with the sum of the high-cost and low-cost operations.
- –
- Phase 1 improves over the naive implementation (Remark 3) under some additional assumptions on the data (in particular, larger stability parameter helps, as well as a small radius—the ratio between the largest and the smallest distance between two points).
- –
- The naive algorithm (Single-Linkage) assumes that Phase 1 has been already solved and implements a simpler Phase 2 (the complexity bounds refer only to the latter). This algorithm is in fact a widely used heuristic with additional theoretical guarantees for hierarchical clustering (see, e.g., [14,15]).
- –
- Phase 2 can be implemented using very few high-cost operations, if we content with a 2-approximate solution. Exact solutions use larger number high-cost operations. Though the dynamic programming (DP) approach has a better theoretical performance for large k, the other algorithm is much simpler to implement (for example, it does not require memory used to store the DP table).
We remark that the best combination between Phase 1 and Phase 2 depends on k and, in some cases, on additional properties of the input data X. - In Section 4, we show that, under additional assumptions on the input data, and a slightly more powerful comparison model, it is possible to implement exact or approximate same-cluster queries (see Section 4.1 and Lemma 10 therein).
- Since same-cluster queries may require some additional power, in Section 4.2, we provide algorithms which use few same-cluster queries in combination with our original (low-cost and high-cost) comparison operations. The obtained bounds are summarized in Theorem 5 and Theorem 6. Intuitively speaking, the ability to preform “few” exact same-cluster queries allows for us to reduce the number of high cost-operations significantly, at least for certain instances:
- –
- When the optimal solutions has approximately balanced clusters, same-cluster queries are enough, and the additional high-cost comparisons are . Both bounds scale with the cluster “unbalance” , where is the smallest cluster size (slightly better bounds for the number of same-cluster queries hold).
- –
- The additional assumption on the input data is only required to be able to implement these few same-cluster queries directly from our model. The result still applies in general if these same-cluster queries can be be implemented in a different—perhaps quite expensive—fashion.
- –
- The aforementioned condition to simulate same-cluster queries, essentially requires that, in the optimal solution, points in the same cluster have a distance of at most times the minimum distance. For larger , this condition becomes less stringent, though of course we require a larger stability coefficient.
1.1.3. Techniques and Relation to Prior Work
2. Model and Preliminary Definitions
2.1. Stable Instances
2.2. Comparisons and Errors
2.3. Performance Evaluation
2.4. Two Algorithmic Tools
- The algorithm uses low-cost queries only (and no high-cost query). Each query compares a pair of elements, and these low-cost queries have an error probability . These comparison errors are persistent.
- The algorithm returns an almost sorted sequence , where the dislocation of each element is at most with probability of at least (the probability depends only on the outcome of the comparisons).
3. Clustering in Stable Instances
3.1. Warm-Up: Phase 2 of Single-Linkage Algorithm
- There is a naive approach using high-cost operations in total (and no low-cost operation).
- Approximate sorting (Lemma 2) can reduce the total number of high-cost operations to , where .
- Approximate the matroid (Lemma 3) would directly further improve the above bound for some values of k. Unfortunately, this leads to a solution which is far more costly than the one returned by the algorithm with exact comparisons (or with the previous methods).
3.1.1. Naive Approach (all at High Cost)
3.1.2. First Improvement (Combining Low and High Cost Operations)
3.1.3. Matroid Approximations Fail
- Distances inside the same group are 0, i.e, for any and any two , we have .
- Distances between groups are , i.e, for any two different groups and , and for and , we have .
- Point v is at distance 1 from points in and distance L from all other points, i.e.,
3.2. Phase 1: Compute a “Good” Spanning Tree
3.3. Phase 2 of Single-Linkage++ (Removing the Edges)
- In order to evaluate the cost of a candidate solution (set K of edges to remove), we need to compute the centroids of each cluster;
- The number of elements in the associated matroid is , though we are interested in extracting only one (the optimal ).
- Consider the corresponding clustering . This can be done by simply inspecting the nodes of each connected component when removing edges from T.
- Run the procedure to compute (some) centroids for this candidate clustering, . This gives a set of edges from centroid to the other elements of the cluster , for all clusters:
3.3.1. Exact Centroids and Exact Solutions
All at High-Cost
Using low cost operations
- Success probability at least . For each cluster , the probability of finding the optimal centroid is at least since . By the union bound over all k clusters, the success probability is at least .
- High-cost operations . The total high-cost of for computing all centroids of any given partition of X is
- Low-cost operations. We observe that this is
3.3.2. Approximate Centroids and Approximate Solutions
3.4. Dynamic Programming
3.4.1. Basic Notation and Adaptations
3.4.2. The Actual Algorithm
Initialization
The Algorithm
Analysis of High-Low Cost
4. Same Cluster Queries
4.1. Small-Radius and Same-Cluster Queries
Implementing a Same-Cluster Query
4.2. An Algorithm Using Few SCQs
- The algorithm uses exact same-cluster queries, which is independent of n for approximately balanced clusters, that is, .
- The additional high-low cost operations used by the algorithm is
4.2.1. The Algorithm
Analysis of the First Step
Analysis of the Second Step
Analysis of the Third Step
Putting Things Together (Proof of Theorem 5)
4.2.2. Extensions of Theorem 5
- Lemma 12 still holds since in Step 1 we simply continue extracting random points until every cluster has at least one point. As already observed, we can safely extract points with replacement as in the CCP problem since extracting the same point more than once has no effect on the stopping condition. Therefore, all upper bounds on the CCP problem also apply (see Lemma 11).
- In Step 2, we can use high-cost comparisons to assign every point to its correct cluster. Since these operations are always correct, we no longer need points per cluster, and can in fact reduce them to , which is the number we need in Step 3 to have a high-probability of success. Selecting points guarantees that the probability that not all clusters have at least points is at most . For a suitable , this probability is constant, say . Therefore, in expectation, we have to wait a constant number of “rounds” of extractions each, meaning that the number of extractions needed before reaching points in each cluster is . Since for every point we have to check all clusters, the expected high-low cost operations of Phase 2 is .
- Step 3 is as before and it takes high-low cost operations. Its success probability is at least , while the two previous steps are always successful since we use high-cost operations.
- The algorithm uses exact same-cluster queries in expectation, where depends on the size of the optimal clusters (Lemma 11), and is the corresponding coupon collector bound (Lemma 12 and Observation 6).
- The additional high-low cost operations used by the algorithm isIf the size of the smallest cluster is known, then the high-low cost upper bound in (7) also applies.
5. Conclusions, Extensions, and Open Questions
5.1. Query Model
- While Phase 1 of the algorithm seems already quite expensive in terms of high-cost operations, we note that it makes use of much simpler queries “?” like in [14]. This is also the case for the Single-Linkage algorithm (Phase 2 naive), which uses very few such high-cost queries (Table 1). While our implementation combines exact (high-cost) and noisy (low-cost) queries, the algorithms in [14] use only noisy queries, though under a different noisy model related to certain planted instances.
- The first exact implementation of Single-Linkage++ (Phase 2) requires computing/ estimating the cost of the clusters given their centroids (3). Instead, the dynamic programming version makes use of the full power of the model. Indeed, our approximation (Phase 2 APX in Table 1) is based on the idea that queries/algorithms that approximately compute the centers (medois) are enough. Definition 4 and Theorem 4 suggest that a query model/algorithm which allows for approximating the centers, and their costs (3), yields an approximate implementation of Phase 2 of Single-Linkage++.
- The Coupon–Collector Clustering algorithm using same-cluster queries (Section 4) uses very simple comparison-queries, namely “” (second and third steps), in addition to the same-cluster queries (first step). The latter can either be available or can be simulated by a richer class of “scalar” queries in (6). As already observed, these are slightly more complex than those in [8,9].
- The model can be refined by introducing a cost dependent on the set of distances involved, e.g., how many they are. Whether comparing just two pairwise distances is easier then comparing groups of distances seems to depend on the application at hand; sometimes comparing groups is easier because they provide a richer “context” which helps.
5.2. Error Model
- Our error model assumes constant (and independent) error probability across all comparisons. Other error models are also possible and they typically incorporate the “distance values” in the error probabilities (very different values are easier to detect correctly than very close ones). Examples of such models can be found in, e.g., [6,9].
- Different error models may result in different (perhaps better) dislocation bounds on the approximate sorting problem (Lemma 2). This may directly improve some of our bounds, where the maximum dislocation is the bottleneck for finding the minimum (or top-k-elements) with high probability (if the maximum dislocation becomes , then we need high-cost operations for the latter problem, which is used as a subroutine in most of our algorithms).
5.3. Open Questions
- Reduce the high-cost complexity of Phase 1 of Single-Linkage++. We notice that, in the analysis, the same cluster is considered several times when trying all edge-removals. A possible direction might be to estimate the number of partitions of a given (spanning) tree, though this problem does not seem to have a general closed form solution [35].
- Our counterexample shows that a direct improvement of Phase 2 solely based on approximate minimum-spanning tree is not possible. We feel that a finer analysis based on some combinatorial structure of the problem might help.
- Extend the results to other center-based clustering problems for which the Single-Linkage++ algorithm is optimal [3]. Our scheme based on approximate centers (Definition 4 and Theorem 4) suggests a natural approach in which we simply need queries that approximate the costs.
Author Contributions
Funding
Conflicts of Interest
References
- Balcan, M.F.; Blum, A.; Vempala, S. A Discriminative Framework for Clustering via Similarity Functions. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, Victoria, BC, Canada, 17–20 May 2008; Association for Computing Machinery: New York, NY, USA, 2008. STOC ’08. pp. 671–680. [Google Scholar] [CrossRef] [Green Version]
- Awasthi, P.; Blum, A.; Sheffet, O. Center-based clustering under perturbation stability. Inf. Process. Lett. 2012, 112, 49–54. [Google Scholar] [CrossRef] [Green Version]
- Angelidakis, H.; Makarychev, K.; Makarychev, Y. Algorithms for Stable and Perturbation-Resilient Problems. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), Montreal, PQ, Canada, 19 June–23 June 2017; pp. 438–451. [Google Scholar] [CrossRef] [Green Version]
- Roughgarden, T. CS264: Beyond Worst-Case Analysis Lecture# 6: Perturbation-Stable Clustering. 2017. Available online: http://theory.stanford.edu/~tim/w17/l/l6.pdf (accessed on 8 February 2021).
- Tamuz, O.; Liu, C.; Belongie, S.J.; Shamir, O.; Kalai, A. Adaptively Learning the Crowd Kernel. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, DC, USA, 28 June 28–2 July 2011; pp. 673–680. [Google Scholar]
- Jain, L.; Jamieson, K.G.; Nowak, R. Finite Sample Prediction and Recovery Bounds for Ordinal Embedding. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 2711–2719. [Google Scholar]
- Emamjomeh-Zadeh, E.; Kempe, D. Adaptive hierarchical clustering using ordinal queries. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–10 January 2018; pp. 415–429. [Google Scholar]
- Perrot, M.; Esser, P.; Ghoshdastidar, D. Near-optimal comparison based clustering. arXiV 2020, arXiv:2010.03918. [Google Scholar]
- Addanki, R.; Galhotra, S.; Saha, B. How to Design Robust Algorithms Using Noisy Comparison Oracles. Available online: https://people.cs.umass.edu/_sainyam/comparison_fullversion.pdf (accessed on 8 February 2021.).
- Gilyazev, R.; Turdakov, D.Y. Active Learning and Crowdsourcing: A Survey of Optimization Methods for Data Labeling. Program. Comput. Softw. 2018, 44, 476–491. [Google Scholar] [CrossRef]
- Geissmann, B.; Leucci, S.; Liu, C.; Penna, P.; Proietti, G. Dual-Mode Greedy Algorithms Can Save Energy. In Proceedings of the 30th International Symposium on Algorithms and Computation (ISAAC), Shanghai, China, 8–11 December 2019; Volume 149, pp. 1–18. [Google Scholar] [CrossRef]
- Xu, Y.; Zhang, H.; Miller, K.; Singh, A.; Dubrawski, A. Noise-tolerant interactive learning using pairwise comparisons. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 2431–2440. [Google Scholar]
- Hopkins, M.; Kane, D.; Lovett, S.; Mahajan, G. Noise-tolerant, reliable active classification with comparison queries. arXiv 2020, arXiv:2001.05497. [Google Scholar]
- Ghoshdastidar, D.; Perrot, M.; von Luxburg, U. Foundations of Comparison-Based Hierarchical Clustering. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32, pp. 7456–7466. [Google Scholar]
- Cohen-addad, V.; Kanade, V.; Mallmann-trenn, F.; Mathieu, C. Hierarchical Clustering: Objective Functions and Algorithms. J. ACM 2019, 66. [Google Scholar] [CrossRef]
- Ng, R.; Han, J. CLARANS: A method for clustering objects for spatial data mining. Knowl. Data Eng. IEEE Trans. 2002, 14, 1003–1016. [Google Scholar] [CrossRef] [Green Version]
- Park, H.S.; Jun, C.H. A simple and fast algorithm for K-medoids clustering. Expert Syst. Appl. 2009, 36, 3336–3341. [Google Scholar] [CrossRef]
- Schubert, E.; Rousseeuw, P.J. Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In International Conference on Similarity Search and Applications; Springer: Cham, Switzerland, 2019; pp. 171–187. [Google Scholar]
- Wang, X.; Wang, X.; Wilkes, D.M. An Efficient K-Medoids Clustering Algorithm for Large Scale Data. In Machine Learning-based Natural Scene Recognition for Mobile Robot Localization in An Unknown Environment; Springer: Cham, Switzerland, 2020; pp. 85–108. [Google Scholar]
- Jin, X.; Han, J. K-Medoids Clustering. In Encyclopedia of Machine Learning; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2010; pp. 564–565. [Google Scholar] [CrossRef]
- Geissmann, B.; Leucci, S.; Liu, C.; Penna, P. Optimal Sorting with Persistent Comparison Errors. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA), Munich/Garching, Germany, 9–11 September 2019. [Google Scholar] [CrossRef]
- Kane, D.M.; Lovett, S.; Moran, S.; Zhang, J. Active Classification with Comparison Queries. In Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), Berkeley, CA, USA, 15–17 October 2017; pp. 355–366. [Google Scholar] [CrossRef] [Green Version]
- Ukkonen, A. Crowdsourced Correlation Clustering with Relative Distance Comparisons. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 1117–1122. [Google Scholar] [CrossRef] [Green Version]
- Ashtiani, H.; Kushagra, S.; Ben-David, S. Clustering with Same-Cluster Queries. In Advances in Neural Information Processing Systems; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 3216–3224. [Google Scholar]
- Ailon, N.; Bhattacharya, A.; Jaiswal, R. Approximate correlation clustering using same-cluster queries. In Latin American Symposium on Theoretical Informatics (LATIN); Springer: Berlin/Heidelberg, Germany, 2018; pp. 14–27. [Google Scholar]
- Saha, B.; Subramanian, S. Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA), Munich/Garching, Germany, 9–11 September 2019; Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik: Wadern, Germany, 2019. [Google Scholar]
- Bressan, M.; Cesa-Bianchi, N.; Lattanzi, S.; Paudice, A. Exact Recovery of Mangled Clusters with Same-Cluster Queries. arXiV 2020, arXiv:2006.04675. [Google Scholar]
- Mazumdar, A.; Saha, B. Clustering with Noisy Queries. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5788–5799. [Google Scholar]
- Sanyal, D.; Das, S. On semi-supervised active clustering of stable instances with oracles. Inf. Process. Lett. 2019, 151, 105833. [Google Scholar] [CrossRef]
- Chien, I.; Pan, C.; Milenkovic, O. Query k-means clustering and the double dixie cup problem. Adv. Neural Inf. Process. Syst. 2018, 31, 6649–6658. [Google Scholar]
- Geissmann, B. Longest Increasing Subsequence Under Persistent Comparison Errors. In Approximation and Online Algorithms; Epstein, L., Erlebach, T., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 259–276. [Google Scholar]
- Berenbrink, P.; Sauerwald, T. The Weighted Coupon Collector’s Problem and Applications. In Computing and Combinatorics; Ngo, H.Q., Ed.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 449–458. [Google Scholar]
- Doumas, A.V.; Papanicolaou, V.G. The coupon collector’s problem revisited: Generalizing the double Dixie cup problem of Newman and Shepp. ESAIM Probab. Stat. 2016, 20, 367–399. [Google Scholar] [CrossRef] [Green Version]
- Newman, D.J.; Shepp, L. The Double Dixie Cup Problem. Am. Math. Mon. 1960, 67, 58–61. [Google Scholar] [CrossRef]
- Székely, L.; Wang, H. On subtrees of trees. Adv. Appl. Math.—Advan Appl Math 2005, 34, 138–155. [Google Scholar] [CrossRef] [Green Version]
Algorithm | Approx. | High-Cost | Low-Cost |
---|---|---|---|
Single-Linkage (Phase 2 naive) | ∞ | ||
Single-Linkage++ (Phase 1) | 1 | ||
Single-Linkage++ (Phase 2) | 1 | ||
Single-Linkage++ (Phase 2 APX) | 2 | ||
Single-Linkage++ (Phase 2 DP) | 1 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bianchi, E.; Penna, P. Optimal Clustering in Stable Instances Using Combinations of Exact and Noisy Ordinal Queries. Algorithms 2021, 14, 55. https://doi.org/10.3390/a14020055
Bianchi E, Penna P. Optimal Clustering in Stable Instances Using Combinations of Exact and Noisy Ordinal Queries. Algorithms. 2021; 14(2):55. https://doi.org/10.3390/a14020055
Chicago/Turabian StyleBianchi, Enrico, and Paolo Penna. 2021. "Optimal Clustering in Stable Instances Using Combinations of Exact and Noisy Ordinal Queries" Algorithms 14, no. 2: 55. https://doi.org/10.3390/a14020055