Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Applying Text Mining, Clustering Analysis, and Latent Dirichlet Allocation Techniques for Topic Classification of Environmental Education Journals

Sustainability 2021, 13(19), 10856; https://doi.org/10.3390/su131910856

by I-Cheng Chang¹, Tai-Kuei Yu², Yu-Jie Chang³ and Tai-Yi Yu^4,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Sustainability 2021, 13(19), 10856; https://doi.org/10.3390/su131910856

Submission received: 16 August 2021 / Revised: 19 September 2021 / Accepted: 24 September 2021 / Published: 29 September 2021

(This article belongs to the Section Sustainable Education and Approaches)

Round 1

Reviewer 1 Report

The article covers an interesting topic. Standard methods and tools are used, such as the k-means clustering algorithm and LDA.
The flow chart on page 6 and its description on page 7, lines 270-278 also include methods that are not described in their definition and implementation in the paper. For example, it would be very interesting to understand how the taxonomy was generated, or whether the different activities in the top right block are done sequentially or independently of each other.
It is therefore recommended to go into technical details, with possible problems and solutions identified and adopted.

There are also small errors, such as the numbering of tables and figures, to be corrected.

Author Response

Q1: The flow chart on page 6 and its description on page 7, lines 270-278 also include methods that are not described in their definition and implementation in the paper. For example, it would be very interesting to understand how the taxonomy was generated, or whether the different activities in the top right block are done sequentially or independently of each other. It is therefore recommended to go into technical details, with possible problems and solutions identified and adopted.

A1: Thank you for the comment. The Figure 1 is revised to a clearer graph than original one.

We added several sentences on lines 270-277.

The upper right corner of Fig. 1 illustrates the tasks of performing process in this study, including four aspects such as document clustering, document categorization, association analysis and taxonomy generation, and comprehensive comparison is achieved and demonstrated through the Post analysis process. The determining factors of applying technical tools and principals to perform the four aspects are input data entering the tasks. The aforementioned four aspects did not possess the relation of interdependence, but exist the subordinate dependency as Fig. 1.

Q2:There are also small errors, such as the numbering of tables and figures, to be corrected.

A2: Thank you for the comment.

Revised on line 304

Delete the redundant words in table 2.

Revised on line 392

Adjust the format in table 3.

Revised on line 427.

Table 4 presents 13-16 representative keywords

Reviewer 2 Report

This is very capable, straightforward application of statistical and unsupervised machine-learning methods to a small to medium-sized topic modeling problem within a narrow linguistic corpus. The connection to Sustainability subsists in the subject of that corpus, which contains articles from academic journals on environmental education.

There are a few things that the authors can do to improve the paper. None of these suggestions stands as an obstacle to publication. I offer these suggestions strictly in the spirit of trying to improve the paper.

I have used both hierarchical and k-means clustering for topic modeling. Indeed, there's a reason these clustering methods are extremely popular in natural language processing (NLP). The literature review can always be a little deeper. Even as it stands, though, the review is adequate.

I would advise, however, a little attention to the exact clustering method. It is not obvious what the combination of hierarchical and k-means clustering consists of, let alone the way the authors implemented this method. Did the authors manually code this method? Or did they find a convenient package that did the heavy lifting for them? A paragraph with a few appropriate references to the computer science literature would be enough.

Even more important than explicit references to the literature might be a few words on the authors' coding methods and environment. Details such as the use (or not) of stop words, vectorizing documents (presumably) as 1-grams, would help others to replicate this research or extend these methods to other corpora.

A brief discussion of the determination of k (number of k-means clusters) and the truncation of hierarchical clustering dendrograms would be very helpful. The typical readership of Sustainability may not appreciate these details. But those versed in NLP and machine learning in general would understand. Some reassurance of a principled approach to these clustering methods would be helpful.

Speaking of dendrograms, it is a little surprising (and I must admit, disappointing) that the authors did not provide these easy and intuitive visualizations of hierarchical clustering. Another visualization method that the authors failed to implement is a two- or three-dimensional representation of their topic space. Viable options include multidimensional scaling, t-SNE, and PCA. All of these methods are readily attainable in Python.

The authors' command of English is impressive, and stronger than my choice to recommend moderate changes might otherwise suggest. There are just enough words and expressions in this highly technical approach to the subject to warrant an appointment with MDPI's English-language editing service. Given the thoughtfulness and efficient execution of this project, a session with a professional language editor should be worth the trouble and (modest) expense.

Congratulations on a fine work of scholarship.

Author Response

Q1: I have used both hierarchical and k-means clustering for topic modeling. Indeed, there's a reason these clustering methods are extremely popular in natural language processing (NLP). The literature review can always be a little deeper. Even as it stands, though, the review is adequate.

A1: Thank you for the comment. The K-means is a very well-known, classical, effective, and commonly used partitioning clustering algorithm in natural language processing (Jain et al., 2012; Tunali et al., 2016; Chen et al., 2018). This paper applied the hierarchical K-means clustering technique instead of classical K-means clustering, and therefore lacked detailed literature review of the K-means.

Chen, X., Xie, H., Wang, F. L., Liu, Z., Xu, J., & Hao, T. (2018). A bibliometric analysis of natural language processing in medical research. BMC medical informatics and decision making, 18(1), 1-14.

Jain, H. J., Bewoor, M. S., & Patil, S. H. (2012). Context sensitive text summarization using K means clustering algorithm. International Journal of Soft Computing and Engineering, 2(2), 301-304.

Tunali, V., Bilgin, T., & Camurcu, A. (2016). An improved clustering algorithm for text mining: Multi-cluster spherical K-Means. International Arab Journal of Information Technology (IAJIT), 13(1).

Q2: I would advise, however, a little attention to the exact clustering method. It is not obvious what the combination of hierarchical and k-means clustering consists of, let alone the way the authors implemented this method. Did the authors manually code this method? Or did they find a convenient package that did the heavy lifting for them? A paragraph with a few appropriate references to the computer science literature would be enough.

A2: Thank you for the comment. Revised on lines 328-329, section 3-2.

The processes of hierarchical clustering and K-means clustering were achieved with IBM SPSS software at current stage.

Q3: Even more important than explicit references to the literature might be a few words on the authors' coding methods and environment. Details such as the use (or not) of stop words, vectorizing documents (presumably) as 1-grams, would help others to replicate this research or extend these methods to other corpora.

A3. Thank you for the comment. The Fig. 1 showed detailed text pre-processing, and revised the sentences on lines 216-217.

Most text data are in unstructured or semi-structured forms, and these words could not be classified or indexed manually, this study applied a data pre-processing procedure. Data pre-processing (NLTK package in Python, Fig. 1) firstly extracts, converts, and cleans up the text data.

Q4: A brief discussion of the determination of k (number of k-means clusters) and the truncation of hierarchical clustering dendrograms would be very helpful. The typical readership of Sustainability may not appreciate these details. But those versed in NLP and machine learning in general would understand. Some reassurance of a principled approach to these clustering methods would be helpful.

A4: Thank you for the comment. Revised on lines 329-333.

The optimum number of topics could be determined with statistical indicators (performance index, perplexity and coherence), elbow method (relation between number of clusters and cost function), subjective judgment, topic interpretability and topic separation (Bholowalia & Kumar2014; Kodinariya & Makwana, 2013; Maier et al., 2018; Shahbazi, Z., & Byun) in topic modeling, and topic interpretability was the holistic factor.

Shahbazi, Z., & Byun, Y. C. (2020). Analysis of domain-independent unsupervised text segmentation using lda topic modeling over social media contents. Int. J. Adv. Sci. Technol, 29(6), 5993-6014.

Bholowalia, P., & Kumar, A. (2014). EBK-means: A clustering technique based on elbow method and k-means in WSN. International Journal of Computer Applications, 105(9).

Kodinariya, T. M., & Makwana, P. R. (2013). Review on determining number of Cluster in K-Means Clustering. International Journal, 1(6), 90-95.

Maier, D; Waldherr, A; Miltner, P; Wiedemann G; Niekler, A; Keinert A,Pfetsch B, Heyer G,Reber U,Häussler T; Schmid-Petri, H & Adam, S (2018). Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. Communication Methods and Measures, 12(2-3), 93-118.

Q5: Speaking of dendrograms, it is a little surprising (and I must admit, disappointing) that the authors did not provide these easy and intuitive visualizations of hierarchical clustering. Another visualization method that the authors failed to implement is a two- or three-dimensional representation of their topic space. Viable options include multidimensional scaling, t-SNE, and PCA. All of these methods are readily attainable in Python.

A5: Thank you for the comment. Revised on lines 329-330, section 3-2.

The visualizations of hierarchical clustering could be an alternative way to present the analytical results; however, this paper collected 877 journal papers, and the visualization of hierarchical clustering that we tried may cause confusion to readers.

The processes of hierarchical clustering and K-means clustering were achieved with IBM SPSS software at this stage.

Q6: The authors' command of English is impressive, and stronger than my choice to recommend moderate changes might otherwise suggest. There are just enough words and expressions in this highly technical approach to the subject to warrant an appointment with MDPI's English-language editing service. Given the thoughtfulness and efficient execution of this project, a session with a professional language editor should be worth the trouble and (modest) expense.

A6: Thank you for the comment. The full paper was sent to an English native speaker to revise.

Q7: Congratulations on a fine work of scholarship.

A7: Thank you.

Reviewer 3 Report

The following technologies are applied in the article: text mining, clustering analysis, and Latent Dirichlet allocation techniques. The content and structure of the article are good. First a conceptual model is presented, and then its realization is shown. The methods are applied correctly. It is proven that TM can be effectively applied to problems with the classification of topics.

My recommendation to the authors is to show more clearly what their contributions are, whether in the proposed methodology or in its application.

Author Response

Q1: My recommendation to the authors is to show more clearly what their contributions are, whether in the proposed methodology or in its application.

A1: Thank you for the comment. Revised on lines 575-579.

The contribution of this research is to provide an alternative framework for automatic document classification in the field of environmental education, compare the differences between different topic classification methods, and emphasize the importance of introducing domain experts of environmental education at current stage.

Round 2

Reviewer 1 Report

The revised paper met the requests for improvements outlined in my previous review.
There are still minor writing errors.

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

The presented paper is dedicated to cluster analysis of journal abstracts from an environmental education domain. Authors use k-Means algorithm to cluster the data crawled from the two selected journals and try to investigate the discovered clusters, characteristic terms including co-word analysis. Although interesting approach, the paper severely lacks from multiple aspects:

Overall structure of the paper is rather confusing and makes the readability harder. Introduction also describes the basic theoretical knowledge about the TM area and state of the art in the selected domain. Then, many concepts (e.g. TF-IDF) are described partially in 2 sections, which is very inconsistent. What is more important, the introduction (and the paper in general), lacks the clear statement of which research questions are being addressed and why. Overall motivation of the research is not sufficient in my opinion.
The other weak point of the paper is the terminology usage in the text mining area. Authors on multiple times confuses the reader with the mixing up the terms "classification" and "clustering". On many places, authors describe the their work as a classification, which is not true, as they perform clustering of the text documents. Classification, as a predictive task, using methods of supervised learning, is a completely different task from a clustering, which is a descriptive task using unsupervised methods. This is prevalent through entire document, also in related work, where authors describe many works using C4.5 and bayesian learners (classification tasks). On the other hand, multiple studies dedicated to text (and also paper abstracts) clustering are missing.
I do not see any real reason to perform a text frequency analysis in section 3.1. and explore the dependence between the wordcount and TFIDF ratio. As the TF or TFIDF score is computed also using wordcount, the mentioned dependency is natural and also expected. Needless to say, that the explanation in the text (lines 290-295) is not necessary.
Fig. 3 - I would prefer to use something different, more appropriate to a journal paper (and also containing more information value) than a wordcloud, e.g. table with counts (or any metric in general), graph visualizing the trends of particular keywords popularity over the time, etc.
A detailed description of a dataset used is missing. Authors state in 2. section, how it was crawled, but a proper detailed description is necessary.
I also miss the details about used tools. Authors use just a generic term “python programming” multiple times. As python language is probably a standard for any analytical task nowadays, I would expect more precise information in a journal manuscript.
Section 3.2. is missing essential information - what kind of K-Means algorithm was used, which implementation (scikit-learn?), and most importantly - which parameters were used and how? (metric, K-value). How they were set? Was there any parameter optimization method used? Why 4 clusters? What was the metric behind setting an optimal number of clusters? Were any other models considered?
Also, the justification of the used methods is missing. I would also recommend to use more advanced methods suitable for text clustering (e.g. Self-Organizing Maps, etc.). In terms of data preprocessing, the text preprocessing should be explored in more detail (any tokenization, lemmatization, stemming, etc. used?) and maybe also more current, state-of-the-art approaches (e.g. word embeddings) could be explored.
From the results, it is difficult to say, what authors claim. Clusters appear to be overlapping (in terms of the characteristic terms for each cluster). The results make more sense with incorporation of the co-word analysis.
In general, the research lack heavily from following any standard knowledge discovery methodology (crisp-dm, semma, or some others). Following such methodology could improve the focus of the research and if used in structuring of the paper, also overall readability. It would help to precisely improve the overall goal of the research, set the data mining goals and help to select the proper methods and their evaluation, understand the data and describe the modeling and evaluation of a given task.

Multiple formal issues are also present:

Text styling and formatting is not followed on many places
Figure referencing is also not unified within the entire text
Figures are low-res, sometimes hard to read in my copy of the paper
Language - the manuscript contains many stylistic and grammar errors, I would strongly recommend to perform a language check by a native speaker, or professional service

I’m afraid, that although the idea is interesting, the manuscript in its current state is not suitable for publishing in the Sustainability journal.

Reviewer 2 Report

The article presents a study on recent trends and characteristics of environmental education journals. The study applies rather classic text mining approaches for this purpose.

The paper's structure is clear, however, the research methodology must be improved. First, the approach mostly applies old-fashioned text analysis techniques like TF-IDF and K-means. Here, many new and better approaches would have been helpful to extract meaningful relations like: LDA, Word2Vec and the graph-based document centroid calculation technique by Unger and Kubek.

The paper does not differentiate e.g. in Fig. 1 between the general methodology and implementation details (Python, libs ...) of the analyses. Fig. 2 is meaningless. The subfigures in Fig. 3 (word clouds) are nice to look at, but they bring no gain in understanding the main results. Fig. 4 is very hard to read and understand but is maybe important as a result. Fig. 4 and 5 are hard to read, even when printing the article in color on A4 letter.

The result presentation must be improved; are there any unexpected findings? You mentioned several times "taxonomy generation", however, you do NOT generate taxonomies, you extract clusters at best.

The paper contains an incredibly large number of grammatical and spelling mistakes. It already starts at the first sentence of the abstract. What are "wights", you mean "weights". Did you proofread your paper even once?

Sentences like "The K-means clustering techniques are one of the most commonly used 139 clustering techniques for document clustering [34]," are horribly to read. You cannot explain a term (clustering in your case) with the same term.

In its current form, the paper cannot be accepted and needs to be improved extensively.

Reviewer 3 Report

The proposal is attractive. To improve, you can include theses questions: -updated references in the bibliography; -a qualitative metehological technique (for example, a Delphi) to improve de interpretation of the results; -to incorporate more "discussion" attending the conclusions of other authors.

Article Menu

Applying Text Mining, Clustering Analysis, and Latent Dirichlet Allocation Techniques for Topic Classification of Environmental Education Journals

Further Information

Guidelines

MDPI Initiatives

Follow MDPI