Next Article in Journal
Experimental Evaluation of Collision Avoidance Techniques for Collaborative Robots
Previous Article in Journal
Integrating Prior Knowledge into Attention for Ship Detection in SAR Images
Previous Article in Special Issue
Analysis and Prediction of MOOC Learners’ Dropout Behavior
 
 
Article
Peer-Review Record

Determining the Quality of a Dataset in Clustering Terms

Appl. Sci. 2023, 13(5), 2942; https://doi.org/10.3390/app13052942
by Alicja Rachwał 1,*, Emilia Popławska 2,*, Izolda Gorgol 2, Tomasz Cieplak 3, Damian Pliszczuk 4, Łukasz Skowron 3 and Tomasz Rymarczyk 4,5
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Reviewer 5: Anonymous
Appl. Sci. 2023, 13(5), 2942; https://doi.org/10.3390/app13052942
Submission received: 20 December 2022 / Revised: 9 February 2023 / Accepted: 23 February 2023 / Published: 24 February 2023
(This article belongs to the Special Issue Data Science, Statistics and Visualization)

Round 1

Reviewer 1 Report

The paper is well done and prepared. 

Author Response

Dear Reviewer, thank you very much for your positive opinion about  the manuscript. We are very happy that you find our article suitable for publication.

Reviewer 2 Report

The authors should be clear in the abstract, and especially in their conclusions, a reader slightly out of the area will have difficulty understanding what the paper is about and especially about the conclusion of the results. Other than that, the authors did their best to introduce the methods. Therefore, I suggest changes to the discussion section, so as to emphasize the advantages and disadvantages of each method, even if they may be a bit redundant. Authors should check for typing errors.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

This paper designs a framework for clustering mixed datasets containing categorical and continuous attributes. The framework preprocess the input data by converting the numerical part into a categorical one. It then embeds all categorical variables using an autoencoder and puts the embeddings into PCA to get real number vectors. The frameworks then build a similarity matrix using the cosine similarity metric. It finally uses several clustering algorithms for the segmentation task, including k-means, DBSCAN,  Louvain Community, Greedy Modularity, and Label Propagation. The experiment was conducted on two resellers and customers datasets. It uses several evaluation metrics, such as Calinski-Harabasz, Davies-Bouldin, NMI, Fowlkes-Mallows, and Silhouette.

In general, the topic and methodology proposed in the paper are interesting to me. However, the writing style and the structure of the paper need to be improved to get the level of "acceptance". Several questions need to be clarified to improve the paper's quality further.

- The Abstract needs to be revised to summarize the paper's objective and highlight what the paper has proposed and its contributions/achievements. The sentence "One of the example datasets is not subject to clustering" at the end of the Abstract does not make sense.

- The Introduction needs to be given more information. The current version is not proper for a scientific paper. 

+ First, introduce unsupervised learning and clustering. Why you focused on mixed data?

+ Give some real-life applications of the proposed framework.

+ The authors should give some info on several other clustering methods. I have used k-means, DBSCAN, and HDBSCAN- an improved version of DBSCAN that allows varying density clusters instead of using a global epsilon distance as in DBSCAN. I observed that k-means, DBSCAN, and HDBSCAN could perform very well in the clustering task. In some cases, DBSCAN and HDBSCAN were even better than k-means since they can remove noises. I also used the HDBSCAN python version. It is an efficient algorithm in terms of runtime and can work well with high-dimensional data. Thus, the authors should also summarize/compare the strength and weaknesses of HDBSCAN [https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html]. In addition, discuss possible methods that can perform clustering for an unknown number of clusters and provide high interpretability, such as hierarchical clustering, the authors can refer [https://doi.org/10.1007/978-981-15-1209-4_1] in the discussion.  From that, highlight the main advantages of the proposed method over other methods.

+ Highlight the major contributions of the paper.

- In section 2, put a table of main notations used in the paper.

- In section 2.2.11, theoretically discuss the complexity of the proposed framework.

- Make figure 1 clearer with copiable text and zoomable.

- In section 3, the authors should use boldface fonts to highlight the best performance for each evaluation metric in Tables 1-4.

- In section 4, the authors should discuss some questions like:

+ How does the problem of missing values in mixed datasets affect clustering performance?

+ There are different ways to treat mixing datasets, a possible way to treat continuous and categorical attributes separately. In addition, you can also convert categorical into continuous form and build the distance metric on continuous variables.

For those questions, you can refer to the paper "Clustering mixed numerical and categorical data with missing values" in the discussion.

 

 

 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

In this manuscript, the authors presented the clustering methodology and their quality evaluations. The authors comprehensively explained the theory of the k-means method, DBSCAN, three graph algorithms, and the methods used to determine the clustering quality. The clustering of two datasets is performed, one dataset is subject to clustering reasonably, while the other one is not. The results and the clustering quality are shown and compared. 

Revision is required before publication. A list of specific comments is as follows: 

 

1. The introduction needs to be improved to explain the importance and novelty of the research.

 

2. The results section can be improved. Some results are not reported in the table, such as results from the k-means method, and NMI and Fowlkes-Mallows index values. The clustering results such as group numbers and customer numbers will be more straightforward to be presented in a table instead of explained in the text. 

 

3. There should be a conclusion section at the end of the manuscript that provides a comprehensive summary of the work. 

 

4. The order of the reference needs to be corrected. Ensure the formats are identical. 

 

5. The resolution of the figures needs to be improved.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report

Please see the attached document. 

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 3 Report

I have checked this revision. The authors have improved the paper's contents significantly. Thus, I vote for an acceptance to publishing the paper.

Reviewer 4 Report

The authors have addressed the comments, and this revision has significantly improved the manuscript. I would recommend the publication of this manuscript in its present form. 

Reviewer 5 Report

N/A

Back to TopTop