Next Article in Journal
Stochastic Models to Qualify Stem Tapers
Next Article in Special Issue
Mining Sequential Patterns with VC-Dimension and Rademacher Complexity
Previous Article in Journal
A Generalized Alternating Linearization Bundle Method for Structured Convex Optimization with Inexact First-Order Oracles
 
 
Article
Peer-Review Record

Deterministic Coresets for k-Means of Big Sparse Data†

Algorithms 2020, 13(4), 92; https://doi.org/10.3390/a13040092
by Artem Barger ‡ and Dan Feldman *,‡
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Algorithms 2020, 13(4), 92; https://doi.org/10.3390/a13040092
Submission received: 3 January 2020 / Revised: 15 March 2020 / Accepted: 1 April 2020 / Published: 14 April 2020
(This article belongs to the Special Issue Big Data Algorithmics)

Round 1

Reviewer 1 Report

The paper gives a new coreset for k-means clustering in Euclidean space. The main novelty of the result is that the construction is sensitive to sparse inputs (i.e. input points are from high dimensional but each contains only a small number of non-zero coordinates), and is independent of both the size and the dimension of the dataset. Streaming and distributed construction of coresets are also given, via a standard merge-and-reduce approach.

The construction is based on a novel observation: an O(1)-approximate solution of m-means for m = k^{1/poly epsilon} is an (k, epsilon)-coreset. This means one could apply existing efficient heuristic, such as k-means++, with a slightly larger m to find the coreset.

This method has many advantages compared with existing ones. For instance, the new method may be implemented as a deterministic algorithm, and could be made to preserve the sparsity of the input. Finally, the practical efficiency of the algorithm is verified by experiments, and the (open-source) implementation works with commercial cloud service (i.e. Amazon).

Detailed comments:

The paper claims it gives the first coreset of size independent of both n and d. It seems to me a recent result "Oblivious Dimension Reduction for k-Means: Beyond Subspaces and the Johnson-Lindenstrauss Lemma" also gives a coreset of size independent of n and d. Please cite the paper and compare your result with it. Inconsistent subsection title style. "1.2 Sparse Big Data." vs "1.3 Coresets" (there's a full stop after "Data"). Please make a pass on the whole paper to ensure the consistency. Sometimes you use k-mean and sometimes k-means. Please consistently use only one of them. In many places, you simply wrote the size of the coreset is k^{O(1)}. I suggest not to hide the dependence in epsilon, because otherwise people may think the coreset has size independent of epsilon. Line 136, please rephrase - the word "embarrassingly" looks strange Line 176-177, what is "1.7em"? Line 179, I think i \in [m] should be i \in [k]? Line 186, "S' is small set R^d" -> "S' is a small subset of R^d" Line 199, in the statement of Theorem 1, please add the time complexity. Title of Section 7 seems confusing. It's not about why the coreset works, it is about in which sense it outperforms existing ones. Maybe "Comparison to Existing Approaches"? Also, you only discussed the advantage of your coreset in Section 7, but please also try to justify/discuss the exponential dependence in 1/eps. Line 311-312, "seems to be suffices" -> "seems to be sufficient" Line 372, "we" -> "We" Line 425, "especially for small coresets" - what does this mean? You meant when the target size of the coreset is small? Please rephrase. Line 430, "can" -> "Can"

Author Response

Thank you very much for you detailed review and very fruitful comment. We addressed and accepted mentioned suggestions and sincerely appreciate you pointed them to us.

The paper you've mentioned and asked to compare with "Oblivious Dimension Reduction for k-Means: Beyond Subspaces and the Johnson-Lindenstrauss Lemma" improves on results showed by Dan Feldman, Melanie Schmidt, and Christian Sohler later also improved by M.Cohen at el, however this is still techniques which uses a random projection therefore resulting in non-deterministic results, moreover since relies on projects - resulted coreset is not subset of input and doesn't preserve sparsity. Our approach provides deterministic construction of the k-means coreset which also preserves sparsity and gives coreset which is the subset of input. We have stressed out the difference in the revisited version by citing the aforementioned article and providing comparison outlining the difference.

 

 

Thanks,

            Danny Feldman & Artem Barger.

Reviewer 2 Report

This paper considers coresets for the k-means problem. State of the art research asserts that coresets of size O(k * poly(eps^-1, log k)) exists. These algorithms typically feature heavy use of randomization and dimension reduction. As I can tell it, this paper considers the problem of constructing coresets of size O(k^O(1)) deterministically and retaining input sparsity.

 

First, I like the problem. Finding deterministic algorithms is well motivated, in my opinion, especially considering that randomized algorithms currently far outperform what we can do deterministically.

 

I found the algorithm itself simple, and the proof clear. I do take some issue with how related work is presented and how the results of this paper are sold. In my opinion, the fact the coreset construction, while indepedent of n and d, is exponential in eps should be mentioned in the introduction. As it is this almost amounts to overselling. The paper also asserts that dimension reduction methods do not retain sparsity. This is false, if one uses, for instance, column selection methods. Furthermore, considering the size of the coreset, having dense O(k)-dimensional point vectors in the presense of coresets of size O(k^(1/eps^2)) is really a minor improvement.

Furthermore, the paper fails to cite recent papers by Makarychev, Makarychev, and Razenshteyn and Becchetti, Bury, Cohen-Addad, Grandoni, and Schwiegelshohn that show that a dimension reduction onto log k dimensions exit. The reason why these should be cited is that using log dimensions, the bound of O(k^(1/eps^2)) can be recovered by running a conventional algorithm with O(k*eps^-d) dimensions. Moreover, using this algorithm also yields the result for k-median, and not only k-means. The only drawback is that the dimension reduction has to be randomized and although derandomizations of the Johnson-Lindenstrauss lemma exist, it is not clear how to combine them with the MMR proof.

As such, I will focus on the deterministic aspect, rather than the sparsity. For the determinstic coreset constructions, the authors should highlight previous work more. In particular, the authors fail to cite the paper by Har-Peled and Kushal. Moreoever, the authors also seem to be unaware of the fact that the paper by Feldman, Schmidt, and Sohler contains a deterministic construction for k-means of size O(k^(eps^-2 log 1/eps)), which is surprising, given that Dan Feldman is an author.

 

Lastly, let me examine the algorithm more. Fundamentally, the idea is to simply use additional centers, hoping that an O(m)-clustering has significantly less cost than an O(k). In practice, one would always expect this to be the case, and this is furthermore supported by various previous papers that evaluate coreset constructions empircally. Again this could have been covered by the authors, as well. As an example, I will point the authors to the streamkmeans++ algorithm and BICO.

Summarzing, I believe this paper could be accepted, with additional revision. In particular, the issues with related work should be addressed. I find the analysis, while simple, interesting enough in its own right, but this does not excuse the lacking treatment of related work. For a more glowing recommedation, I would encourage the authors to compare their work with a few more relevant competitors. As it stands, I would recommend an accept, but not particularly enthusiastically.

Author Response

We really appreciate your comments and the review effort. We significantly revisited our previous submission whereas addressed your comments and suggestions with respect to relevant related work and comparison with prio art of kmeans coresets.

 

Additionally we would like to address and add a few clarification to the comments we have received from you to provide a bit more clarity. Please see below:

 

> In my opinion, the fact the coreset construction, while independent of n and d, is exponential in eps should be mentioned in the introduction.

 

We added citation and comparison with the stated paper, thanks for suggestion!

 

> As it is this almost amounts to overselling. The paper also asserts that dimension reduction methods do not retain sparsity. This is false, if one uses, for instance, column selection methods. 

Indeed, we revisited our submission and tried to improve that particular point. We meant that they not necessarily preserve sparsity. This is indeed the case for most of them. Column selection algorithms that we know either randomized, depends on d, depends on n, do not support any k centers and usually few of the above.
Anyway, sparsity is only one of many applications that we mentioned for having a coreset which is subset. Dimension reduction techniques, including column selection, returns small sets of subset of columns - not rows (input points). Nevertheless, our technique might be applied on these projections


> Furthermore, considering the size of the coreset, having dense O(k)-dimensional point vectors in the presense of coresets of size O(k^(1/eps^2)) is really a minor improvement.

For small epsilon this indeed may be correct, but our result is new even for eps=1/2, where k^{1/eps^2}=k^4.  Moreover, the result of dim-reduction technique as in Feldman, Sohelr project the points into a k-subspace that may contain in a high d-dimensional space. We do not see how one can compute the desired centers in the original space without also storing the d-by-k projection matrix, whose size depends on d.

Finally, since the coreset is independent of d it can support points in "infinite dimension" in some sense (as continuous functions, or Bregman divegence) where even small dependency of d is prohibited.


> Furthermore, the paper fails to cite recent papers by Makarychev, Makarychev, and Razenshteyn and Becchetti, Bury, Cohen-Addad, Grandoni, and Schwiegelshohn that show that a dimension reduction onto log k dimensions exit. The reason why these should be cited is that using log dimensions, the bound of O(k^(1/eps^2)) can be recovered by running a conventional algorithm with O(k*eps^-d) dimensions.

These papers suggest sketch for dimension reduction (that also do not preserve sparisty), and not coresets (subsets). Moreover, their sketches are only for optimal clustering while our ("strong") coreset is for any clustering. These sketch size depends on n/ Finally, our construction is deterministic while constructions mentioned in the suggested papers is randomized. In addition size  these sketches sizes are much higher than 1/eps^2, and also depends on n and k, so d will be not be replaced by 1/eps^2. And lastly, the result will not be subset of the input, will not approximate any cluster, and may fail with some probability, as explained above.

 

> As such, I will focus on the deterministic aspect, rather than the sparsity. For the determinstic coreset constructions, the authors should highlight previous work more. In particular, the authors fail to cite the paper by Har-Peled and Kushal.

Thanks for suggestion, we indeed overlooked to mention work of Har-Preled and Kushal, we fixed, howeever, we would like to note that the paper by HarPeled and Kushal, unlike the paper of har-Peled and Mazumdar that we cite, does not return subset of the input points.

> Lastly, let me examine the algorithm more. Fundamentally, the idea is to simply use additional centers, hoping that an O(m)-clustering has significantly less cost than an O(k). In practice, one would always expect this to be the case, and this is furthermore supported by various previous papers that evaluate coreset constructions empircally. Again this could have been covered by the authors, as well. As an example, I will point the authors to the streamkmeans++ algorithm and BICO."

Thanks we added appropriate explanation based on your suggestion into Algorithm Overview section to explain the rational behind the idea and motivation. Indeed, our experimental results show that this idea works in practice much better than the theory predicts. However, unlike previous heuristics, our technique has provable upper bounds on the required number of centers.


We sincerely appreciate effort reviewing our work and providing so deep and fruitful comments, hopefully new revision we managed to addressed most if not of your concerns.

 

With best regards,

                        Artem Barger and Danny Feldman.

Back to TopTop