Enhanced TextNetTopics for Text Classification Using the G-S-M Approach with Filtered fastText-Based LDA Topics and RF-Based Topic Scoring: fasTNT

Voskergian, Daniel; Jayousi, Rashid; Yousef, Malik

doi:10.3390/app14198914

Open AccessArticle

Enhanced TextNetTopics for Text Classification Using the G-S-M Approach with Filtered fastText-Based LDA Topics and RF-Based Topic Scoring: fasTNT

by

Daniel Voskergian

^1,*,

Rashid Jayousi

² and

Malik Yousef

³

¹

Computer Engineering Department, Al-Quds University, Jerusalem 20002, Palestine

²

Computer Science Department, Al-Quds University, Jerusalem 20002, Palestine

³

Information System Department, Zefat Academic College, Zefat 1320611, Israel

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8914; https://doi.org/10.3390/app14198914

Submission received: 18 June 2024 / Revised: 6 July 2024 / Accepted: 15 July 2024 / Published: 3 October 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

TextNetTopics is a novel topic modeling-based topic selection approach that finds highly ranked discriminative topics for training text classification models, where a topic is a set of semantically related words. However, it suffers from several limitations, including the retention of redundant or irrelevant features within topics, a computationally intensive topic-scoring mechanism, and a lack of explicit semantic modeling. In order to address these shortcomings, this paper proposes fasTNT, an enhanced version of TextNetTopics grounded in the Grouping–Scoring–Modeling approach. FasTNT aims to improve the topic selection process by preserving only informative features within topics, reforming LDA topics using fastText word embeddings, and introducing an efficient scoring method that considers topic interactions using Random Forest feature importance. Experimental results on four diverse datasets demonstrate that fasTNT outperforms the original TextNetTopics method in classification performance and feature reduction.

Keywords:

topic selection; topic modeling; feature selection; machine learning; text classification; semantics; fasTNT

1. Introduction

Text Classification is an important research direction in natural language processing and text mining. It focuses on categorizing vast amounts of unstructured documents by assigning predefined categories to unlabeled documents based on their content. One of the critical challenges in text datasets is their sparsity and high dimensionality. For instance, a single corpus can contain tens or even hundreds of thousands of features, leading to the well-known issue known as the curse of dimensionality [1].

In this aspect, highly dimensional datasets with excessive features often contain irrelevant, noisy, and redundant data, leading to degradation in classification performance in machine learning tasks (i.e., overfitting). In addition, they create memory space shortages, require high computing power to fit a model, and limit the model’s interpretability. Therefore, selecting a representative subset of features by utilizing an efficient feature selection method is necessary for constructing a robust and reliable classification model in a feasible amount of time [2].

Feature selection (FS) involves extracting a subset of relevant features with high discriminative ability while discarding noisy and redundant ones from the original feature set. Based on their interaction with a machine learning model, FS methods are categorized into filter, wrapper, and embedded methods [3].

The filter method selects features based on statistical measures that reflect their importance, such as the chi-square test, odds ratio, mutual information, and correlation coefficient. This method is independent of any learning algorithm and requires less computational time [4].

The wrapper method uses search strategies to find candidate feature subsets and selects the best subset based on the predictive performance of a pre-chosen classifier. Due to their iterative learning steps and cross-validation, wrapper methods are more computationally expensive than filter methods but are considered more accurate. Examples of wrapper methods include genetic algorithms, sequential feature selection algorithms, and recursive feature elimination algorithms [4].

The embedded method embeds feature selection within the training process of a machine learning model. Random forest is an example of an embedded method. Unlike wrapper methods, embedded methods are computationally less intensive because they do not need to assess feature sets iteratively. They also outperform filter methods by considering interactions with the learning algorithm. However, a limitation is their model-specific nature, which restricts their applicability to other learning models [5].

A recent trend along this line is to employ topic modeling (TM) as an innovative feature reduction approach to avoid the well-known “curse of dimensionality”. Topic modeling is an unsupervised machine learning procedure for extracting latent variables or abstract topics from large training datasets. Topic modeling reveals concealed patterns in data and provides a meaningful, latent representation of documents. This automated approach facilitates the organization, understanding, and summarization of vast textual data, thereby enriching our understanding of the underlying themes within the dataset [6].

Topic modeling has been widely used in various domains, particularly in text classification, as a method for feature projection. It works by calculating topics as distributions of words and then representing each document as a distribution across these topics. This topic-to-document distribution matrix serves as an input for training document classifiers [7]. Despite its widespread use in document representation, topic modeling in the context of feature selection has only been explored in the existing literature to a limited extent.

A pioneering algorithm that fills this gap, drawing inspiration from [8], is TextNetTopics [9], a topic modeling-based feature selection method that is specifically crafted for text classification. It relies on Latent Dirichlet Allocation (LDA) as its core topic modeling technique, which identifies latent topics. These latent topics are composed of words that are semantically related. The algorithm assigns a score to each topic, reflecting its discriminative power based on a machine learning algorithm, and selects the top topics with the highest scores. These selected topics form a subset of distinctive words for binary classification scenarios. Subsequently, this subset of topics is used to train the classifier. A more refined version, TextNetTopics Pro [10], has also been developed explicitly for short-text classification. Both algorithms are based on the Grouping–Scoring–Modeling (G-S-M) approach [11]. For an extensive review of feature selection approaches based on the grouping of features, readers are encouraged to refer to [12,13].

However, TextNetTopics comes with three main shortcomings:

Inclusion of Redundant or Irrelevant Features. TextNetTopics selects r top-ranked topics extracted by Latent Dirichlet Allocation (LDA) to train a machine learning model, where a topic consists of a set of words. However, a significant drawback of this approach is that the selected topics may include redundant or irrelevant features, which increase the size of the final feature set while negatively impacting classification performance.
Computational Burden in Topic Scoring. TextNetTopics employs a topic-scoring mechanism that can become a computational burden as the number of topics increases. This mechanism involves training a Random Forest model on each topic independently using the Monte Carlo cross-validation (MCCV) approach and assigning a score based on a specific mean performance metric (e.g., accuracy, f1-score, etc.). Moreover, the score does not account for interactions between all features residing in all topics.
Explicit Semantic Modeling. LDA does not explicitly model semantic relationships between words; it indirectly captures semantic information by grouping words that frequently co-occur in similar contexts within the dataset based on statistical patterns.

The main objective of this study is to propose fasTNT, an enhancement of TextNetTopics based on the G-S-M approach, to overcome the mentioned limitations. The main contributions of this research paper can be summarized as follows:

This study addresses the issue of feature redundancy and relevancy within topics by introducing a robust filtering component that leverages Random Forest feature importance values. By systematically removing the least informative features from LDA topics, the proposed method retains only the most relevant features for the classification task. This selective feature retention ensures that the resulting feature set is both concise and highly discriminative, leading to more efficient computations and improving overall classification performance and generalization capabilities.
Our study introduces a more efficient approach to the scoring component that allows for the simultaneous training of all topics and scoring via leveraging the feature importance-based Random Forest. This novel approach considers all topic interactions while scoring, providing a significant advancement over the one used in TextNetTopics.
In this study, we enhance LDA topics by explicitly leveraging semantic relationships through pre-trained word embeddings. This enhancement is integrated into the grouping component, where the embeddings provide a dense, continuous representation of words that preserves their semantic similarities. By incorporating word embeddings, we refine the LDA topics to form more semantically coherent groups of words, thus improving the overall quality of the topics.

The paper is structured into the following sections: Section 2 covers related work, including relevant studies and concepts. Section 3 outlines the proposed methodology in detail. Section 4 describes the experimental work. Section 5 presents the experimental results. Finally, Section 6 discusses the conclusion and future work.

2. Related Work

2.1. Relevant Studies

In the context of text classification, topic modeling can serve as both a feature projection (FP) and a feature selection (FS) technique. Extensive studies have used TM as a feature projection method [14,15,16,17]; however, in this study, we focus on its application as a feature selection method, where the topic–word matrix output is utilized to identify a subset of relevant terms to train the text classifier.

Zhang et al. [18] conducted a study that utilized LDA with Gibbs sampling as an FS method for text classification. The authors leveraged the topic-term matrix to identify words with lower entropy to train a state-of-the-art classifier. Instead of using Gibbs sampling for estimation, Taşcı et al. [19] conducted a similar study using variational expectation maximization.

In contrast to using LDA, Al-Salemi et al. [20] employed labeled LDA (LLDA), a supervised variant of LDA, as an FS method to enhance the weak learning of Ada-Boost. LLDA takes into account the categorical structure of documents to generate topics corresponding to the number of categories. Each word is assigned a probability for each topic. According to their probabilities, a specific number or percentage of training features are selected and subsequently utilized to train the classification model.

Instead of selecting individual words, Yousef and Voskergian [9] introduced TextNetTopics, an LDA-based topic selection approach for text classification. After scoring LDA topics using a cross-validated accuracy topic-scoring mechanism, TextNetTopics selects the top r-ranked topics to train the text classifier, where each topic represents a set of semantically related terms. The aim of this approach is to maintain the benefits of feature interaction offered by topics while eliminating topics that could introduce noise and potentially degrade the model’s performance.

Building on TextNetTopics, Voskergian et al. [10] introduced TextNetTopics Pro, an innovative method designed specifically for short text classification. This approach combines topics of words and topic distribution extracted by a short text topic model (e.g., GSDMM, BTM), aiming to overcome the challenge of data sparsity in short text classification. The foundation of TextNetTopics and TextNetTopics Pro is the G-S-M approach [11].

However, TextNetTopics has three main shortcomings that limit its effectiveness in text classification tasks: First, TextNetTopics selects topics extracted by LDA with all the features they contain to train a machine learning model. This method may include redundant or irrelevant features, which can negatively impact classification performance in addition to increasing the size of the feature subset. Second, TextNetTopics employs a topic-scoring mechanism that becomes more computationally intensive as the number of topics increases due to the independent training of each topic and the use of cross-validated accuracy as the scoring metric. Finally, LDA does not explicitly model semantic relationships between words: it captures semantic information indirectly through statistical co-occurrence patterns.

This study aims to overcome these limitations by proposing fasTNT, an advancement of TextNetTopics which incorporates a filtering component to remove the least informative features using random forest feature importance values, employs a more efficient scoring mechanism that considers topic interactions, and leverages pre-trained word embeddings to model semantic relationships between words explicitly.

The following subsections introduce essential concepts relevant to our study.

2.2. Latent Derelict Allocation

Latent Dirichlet Allocation is a widely used and extensively studied method in topic modeling literature. It is a fully generative, probabilistic, unsupervised model that helps discover hidden thematic structures in a dataset. The core concept of LDA is to represent each document as a combination of topics, with each topic being a probability distribution over a fixed vocabulary. The distribution of topics in each document effectively summarizes the document’s content. LDA extends probabilistic latent semantic analysis (PLSA) by incorporating a Dirichlet prior in both the topic–word and document–topic distributions, and it uses Bayesian inference rather than maximum likelihood estimation to determine these distributions [21,22].

2.3. G-S-M Approach

The G-S-M (Grouping–Scoring–Modeling) approach offers a novel perspective on feature selection compared to traditional methods. While traditional techniques assess individual feature importance in isolation, G-S-M organizes features into groups based on prior knowledge (such as microRNA–target and protein–protein interactions), considering their interdependence. This approach provides a more meaningful and holistic evaluation of feature relevance and utility. By combining machine learning with domain knowledge, G-S-M effectively groups and scores features based on their association with a binary-labeled target, making it a unique and valuable tool in feature selection. Embedded feature selection is then performed using Monte Carlo cross-validation (MCCV) to select the most discriminative feature groups (the highest-ranked groups) to train the model [11].

The G-S-M approach formed the basis for developing tools like maTE [23], PriPath [24], GediNET [25], miRcorrNet [26], 3Mint [27], GeNetOntology [28], TextNetTopics [9], TextNetTopics Pro [10] microBiomeGSM [29], miRGediNET [30], miRdisNET [31], miRModuleNet [32], and CogNet [33], which integrate biological networks and prior knowledge to provide a comprehensive understanding of genetic interactions.

2.4. Random Forest

Random Forest (RF) is a widely used machine learning algorithm introduced by Breiman in 2001 that uses randomization to create an ensemble of unpruned decision trees [34]. Each decision tree is constructed during model training using a different bootstrap sample of the training data, consisting of approximately two-thirds of the total observations and a randomly selected subset of variables. Samples that are not used in the training process are referred as to “Out-of-Bag Samples” (OOB) and are utilized to evaluate the generalization performance of the classifiers. The final prediction for an input sample is determined by majority voting during the classification [35].

2.5. Feature-Importance-Based RF

In addition to its classification capabilities, Random Forest provides an internal measure of feature importance by scoring the features during classification. These scores are based on the features’ contribution to the classification process within the Random Forest model. They can be utilized to select the most relevant feature subset that highly impacts the model and most strongly influences the predictions.

The RF algorithm calculates feature importance from the training model using the mean decrease in accuracy (MDA) and the mean decrease in Gini (MDG). MDA quantifies how much model accuracy decreases when the values of a feature are randomly shuffled (feature values are changed). MDG is the mean of a feature’s total reduction in node impurity, weighted by the proportion of observations reaching that node in every decision tree. The larger the decrease in both methods, the more important the feature is considered [36].

Random Forest’s feature importance measures offer a distinct advantage over univariate screening methods because they evaluate the impact of each feature both individually and in interaction with other features [37].

2.6. fastText

fastText is an extension of the Skip-Gram model utilized in Word2vec. Contrary to Word2vec, which processes individual words in the neural network, fastText, developed by the Facebook research team, decomposes words into multiple n-grams or sub-words. After training the neural network, we acquire word embeddings for all n-grams based on the training dataset. For instance, the trigrams for the word “medical” are “med”, “edi”, “dic”, “ica”, and “cal”. The word embedding vector for “medical” is then calculated as the sum of these n-grams.

fastText is effective in representing rare or out-of-vocabulary words (words not seen in a training corpus) because there is a higher chance that some of their n-grams also occur in other words. It is important to note that Word2vec does not offer vector representations for words that are absent from the training corpus [38,39].

3. The Proposed Methodology

Let D represent the document–word dataset. We split this dataset into two sub-datasets: D_train and D_test. We utilize D_train to reduce the features that constitute the LDA topics, score the topic clusters, and train the model, while D_test is primarily used to test and document the final performance.

The proposed methodology is illustrated in Figure 1 and Figure 2. In general, fasTNT consists of five main components (T, F G, S, and M). These components collaboratively aim to perform feature reduction while preserving semantic relations between words through topic clusters. Additionally, they work to minimize the number of irrelevant topic clusters, selecting only the significant ones before the training phase. This process ultimately enhances classification performance.

3.1. T Component: Extracting LDA Topics

At first, fasTNT employs Latent Dirichlet Allocation (LDA), an unsupervised generative probabilistic topic modeling algorithm, to extract hidden semantic themes, known as topics, from a preprocessed collection of documents (refer to Figure 3). This algorithm requires two main parameters from the users: the number of topics (k) and words per topic (m). After LDA training, it produces two byproducts: the topic–word matrix (TW) and the document–topic matrix (DT). The document–topic matrix distribution indicates the extent to which each document is associated with each topic. In contrast, the topic–word matrix distribution probabilities indicate the extent to which each word is associated with each topic. fastTNT uses only the topic–word matrix for further analysis.

3.2. F Component: Filtering Features from LDA Topics

fasTNT performs the following steps in order to remove redundant and irrelevant features from the LDA topics (refer to Figure 4):

Initially, fasTNT utilizes the topic–word matrix (TW) to generate a word list (W). The W will consist of n distinct words present across all topics. In other words, fasTNT eliminates duplicate words across different topics in TW, retaining unique ones.

Afterward, fasTNT leverages the training document–word dataset (D_train), containing d documents and z words, where z represents the total number of distinct words in the preprocessed dataset, along with the word list (W), comprising n distinct words, to generate a two-class d × (n + 1) dimensional sub-dataset (denoted as D_Wtrain). This sub-dataset consists of distinct words that coexist in all topics and are associated with the corresponding class label.

The D_Wtrain sub-dataset is then used by fasTNT to train a Random Forest (RF) classifier using the stratified Monte Carlo cross-validation (MCCV) approach. This approach randomly chooses q percent (i.e., 70%) from the D_Wtrain dataset to train the RF, repeated j (i.e., 10) times. The primary objective of this training phase is to determine the feature importance values for each word in W. The importance of each feature (word) was estimated by averaging the results from j-executed iterations.

The MCCV process ensures robust evaluation and results in a mean feature weights list that indicates the importance of each feature (word) in W, denoted as FI = {FIw₁, FIw₂, …, FIw_n}, where n is the total number of features in W, and FIw_i represents the mean importance of feature wi in W.

According to the feature importance values obtained (FI), words in W are ranked accordingly. The newly ranked word list is denoted as W_R.

To this end, a user-specified v percent of words with the lowest feature importance values are discarded from the ranked word list (W_R). This refined list, now referred to as the final word list (W_F), consists of f words (where f < n). The feature importance values for the words in W_F are retained for subsequent analysis, forming a new feature weight list referred to as FI_F.

The W_F will be used in the next G component, while the FI_F will be utilized in the S component for topic scoring purposes.

3.3. G Component: Reforming LDA Topics as Topic Clusters

Reconstructing topics as extracted by LDA with the remaining features in W_F may result in topics of unequal sizes (refer to Figure 4). This imbalance can pose challenges in the scoring component, potentially introducing bias. Smaller topics with a few highly important features might score higher than larger topics with more averaged importance values, potentially overshadowing larger topics in the model training process.

To tackle this issue, fasTNT employs an extended approach called same-size k-means [40], which strives to create k compact clusters of equal size of b, where b = f/k. Here, k corresponds to the number of LDA topics, and f represents the total number of words in the final word list (W_F). It is important to note that the cluster size is less than the LDA topic size (b < m).

In this aspect, fasTNT performs word clustering using word embedding (WE) to preserve the semantic relation between words in each cluster since the distances between word embedding vector representations can reflect the semantic similarities between words.

In this approach, fasTNT leverages a pre-trained fastText model (refer to Figure 5), a semantic information source about features, to generate word embeddings for all words in W_F. Specifically, fastText provides embeddings for n-grams, and the word embedding vector for a complete word is calculated as the sum of these n-gram embeddings. After transforming W_F into a word embedding matrix, it is used as input for the same-size k-means to create multiple preliminary clusters of words that share semantic similarities or relatedness.

By explicitly leveraging semantic information through external knowledge sources (e.g., word embeddings pre-trained over millions of documents), fasTNT aims to enhance the semantic cohesion of LDA topics. This enhancement can be logically explained since LDA does not explicitly model semantic relationships between words; it indirectly captures semantic information by grouping words that frequently appear together in similar contexts over the dataset based on statistical patterns.

We refer to the produced clusters as topic clusters (TC) to differentiate them from the original topics (T) extracted by LDA. The TC will form an input for the S component.

3.4. S Component: Scoring Topic Clusters

At this phase, fasTNT evaluates each topic cluster’s predictive power and importance for the classification tasks. This process is achieved using a novel topic-scoring approach called feature importance-based topic scoring (FITS).

fasTNT uses two main data inputs for scoring: the topic clusters (TC) and the final mean feature weight list (FI_F). In order to score a specific topic cluster (TC_i) comprising a set of words, TC_i = {w_i1, w_i2, …, w_ib}, where b is the number of words in the cluster, and i is the topic cluster number ranging from 1 to k, we first extract from the final mean feature weight list (FI_F) the feature importance values associated with those words exclusively. Let FI_F(TCi) = {FI_F(wi1), FI_F(wi2), …, FI_F(wib)} be the subset of feature weights corresponding to the words in topic cluster TC_i. Afterward, we calculate the topic cluster score S_TCi using an aggregation function S_TCi = Agg(FI_F(TCi)). The aggregation function may be the arithmetic mean, median, geometric mean, or another suitable measure. Figure 6 illustrates this scoring method, visually representing how the scoring process works.

For this study, the geometric mean is chosen as the topic cluster weighting function (refer to Equation (1)) to dampen the impact of extreme values in the set of feature importance values FI_F(TCi). This approach ensures that the topic cluster score is robust and reflects the topic’s overall importance in the context of the dataset.

S_{T C i} = {(\prod_{j = 1}^{b} F I_{F (w_{i j})})}^{\frac{1}{b}} \dots \dots

(1)

These topic clusters are then ranked in descending order according to their computed scores.

3.5. M Component: Stepwise Topic Cluster Aggregation and Model Training

In the final phase, fastTNT cumulatively assesses the ranked topic clusters, beginning with the top-ranked topic cluster (TC_R1) and in a stepwise manner incorporating the remaining ones (i.e., TC_R2, TC_R3) until all topic clusters up to the kth rank are merged. This process creates an aggregated set of words for each accumulation, from which two-class sub-datasets are extracted from the training and testing datasets (D_train and D_test). Subsequently, the fastTNT manages the training and testing of the machine learning model, employing Random Forest. Ultimately, it identifies the optimal topic cluster subset from all potential subsets, selecting the subset with the highest performance, which indicates the most discriminative power for the text classification task. This subset consists of r topic clusters {TC_R1, TC_R2, …, TC_Rr}, where r < k. Figure 7 and Figure 8 depict the working mechanism of the M component over the first two iterations.

For instance, in the first iteration (refer to Figure 7), fasTNT utilizes only the top-ranked topic cluster. The words within this topic cluster are used to create training and testing sub-datasets from D_train and D_test, respectively. Each sub-dataset has dimensions of d_train × (b + 1) and d_test × (b + 1), where d_train and d_test denote the number of documents in the training and testing datasets, respectively, and b denotes the number of words in the topic cluster. Subsequently, fasTNT trains a Random Forest model using the training sub-dataset and evaluates its performance using the testing sub-dataset.

In the second iteration (refer to Figure 8), fasTNT combines the top-ranked and second-ranked topic clusters. The words from these two clusters are aggregated and used to generate new training and testing sub-datasets from D_train and D_test, with each sub-dataset having dimensions of d_train × (2b + 1) and d_test × (2b + 1), respectively. The Random Forest model is again trained and evaluated with these extracted sub-datasets.

This iterative process continues, progressively accumulating topic clusters until all k topic clusters have been incorporated. Each iteration refines the model by incrementally adding topic clusters, aiming to identify the optimal subset of topic clusters from all potential subsets that result in the highest performance (i.e., accuracy). This optimal subset is then used to train the final model, ensuring that the most relevant topic clusters are included for superior classification accuracy.

4. Experimental Work

4.1. Datasets

In this study, four distinct datasets were used for the assessment.

The WOS-5736 dataset [41] consists of three higher-level classes containing 5736 documents. We converted the dataset into two balanced classes for the empirical evaluation of our proposed approach. The class with the highest number of documents, comprising 2847 abstracts, was chosen as the positive class. The remaining two classes, with 1597 and 1292 abstracts, were combined to create the negative class.
The LitCovid dataset [42] consists of 16,127 multi-label records across five categories. For our study, we focused on single-label records, resulting in the following distribution: forecasting (119), case-report (1334), prevention (6513), treatment (1632), and mechanism (429). To apply fasTNT, we transformed the dataset into a binary class format, considering prevention as the positive class and the remaining categories as the negative class. After that, we balanced the dataset by downsampling the positive class to 3500 records, resulting in a final dataset size of 7014.
The arXiv paper abstract dataset [43], a multi-label text dataset from Kaggle, was used to create a binary balanced class dataset. The positive class comprised papers in the Computer Vision and Pattern Recognition category, totaling 8822 samples. The negative class was formed by merging the remaining fields, resulting in 8341 samples. To preserve the distribution, we applied stratified sampling for the non-relevant fields.
The Multi-Label dataset [44] includes 20,972 documents (abstracts and titles) categorized into Quantitative Finance, Quantitative Biology, Physics, Mathematics, Computer Science, and Statistics. For evaluation purposes, we chose 3500 documents labeled only as Computer Science to represent the positive class. We randomly selected 3500 documents without the Computer Science label to form the negative class.

4.2. Text Preprocessing

Text preprocessing is a fundamental task to improve a dataset’s quality and reduce the input text representation dimensionality by vector space model for text classification tasks, given the presence of noisy and irrelevant data in raw documents. This process involves several systematic Natural Language Processing (NLP) tasks executed using KNIME workflows [45,46]. These tasks include removing all punctuation marks, filtering out numerical digits, eliminating words with fewer than three characters, converting all words to lowercase, removing stop words using the Stop Word Filter node, stemming words using the Snowball stemming library, discarding rare terms with a 1% minimum document frequency, and retaining only English texts, determined with the help of a language detector from the Tika collection.

We employed the relative term frequency method to transform the text into the vector space model (VSM), yielding a document–term matrix output. In the relative term frequency scheme, the weight of a specific feature t_i within the feature space is determined by the frequency of that feature in a particular document d_j, divided by the total number of words in that document. Subsequently, we utilized the Random Forest algorithm to extract the importance of topics and for the model construction.

4.3. Experimental Setup

In order to perform comprehensive natural language preprocessing and to assess the performance of TextNetTopics with our novel methodology fasTNT, we utilized KNIME workflows available on KNIME Hub [46] and GitHub [45]. We employed KNIME’s parallel thread implementation of Latent Dirichlet Allocation to extract topics [47]. To facilitate a fair comparison with the results reported by TextNetTopics [9], we opted for a configuration of twenty topics, each containing twenty words. This setup was selected as it had shown optimal performance in the context of TextNetTopics. Furthermore, we utilized pre-trained word embedding vectors for the English language, trained using Continuous Bag of Words (CBOW) with position-weights on the Common Crawl and Wikipedia corpus using fastText, with vectors of dimensionality 300 [48]. We used the “Same-size k-means” algorithm for word clustering, which was implemented as a KNIME node [40].

We utilized a 10-stratified Monte Carlo cross-validation approach in all the experiments conducted for this study. This technique randomly partitioned each dataset into a 90% training set and a 10% testing set in each iteration. The process was repeated ten times, and the results were averaged to provide a more robust evaluation of the model’s performance. The stratified approach was crucial to preserving the distribution of classes and preventing class bias in the evaluation process.

4.4. Evaluation Metrics

In order to evaluate the effectiveness of fasTNT compared to the TextNetTopics, we employed various performance metrics, including Accuracy, Precision, Recall, F1-score, and Area Under the Curve (AUC).

5. Experimental Results

5.1. Exploring the Impact of Feature Discarding Percentages on fasTNT Performance

In this section, we evaluate the performance of fasTNT by varying the percentage of features discarded (v%) with the lowest feature importance values from the ranked word list (W_R). The v% values tested are 0% (no reduction), 10%, 20%, 30%, 40%, 50%, 60% and 70%. As shown in Figure 9, Figure 10, Figure 11 and Figure 12, varying the v% values results in F1-score performance improvements across the accumulated top-ranked topic clusters (i.e., the top-ranked topic cluster, top two ranked topic clusters, up to the top twenty ranked topic clusters (i.e., the top-ranked topic cluster, top two ranked topic clusters, up to the top twenty ranked topic clusters) over when no feature reduction is performed at all., after which no feature reduction is performed at all).

One interesting point is that when the feature-discarding rate (v%) is set to 70%, the F1-scores across the aggregated top-ranked topic clusters are the highest within the utilized feature set size (W_F). This trend is consistently observed across all four datasets, indicating the effectiveness of discarding lower-importance features in enhancing the predictive power of topic clusters for better classification performance.

However, the maximum F1-score attained by each feature-discarding rate varies among datasets. Specifically, the maximum performance is preserved at ~97% for the WOS-5736 dataset and ~91% for the LitCovid dataset, even as the feature-discarding rate (v%) increases from 10% to 60%. This indicates the ability of fasTNT to effectively remove redundant and irrelevant features from the feature pool, thereby enhancing model efficiency. However, after v% exceeds 60%, the maximum performance for both the WOS-5736 and LitCovid datasets starts to decrease, suggesting that overly aggressive feature reduction may discard useful features. In contrast, for the MultiLabel and arXiv datasets, the achieved performance remains at ~85% and ~93%, respectively, when setting v% from 10% to 40%. However, after v% exceeds 40%, the maximum performance attained starts to decrease gradually. This suggests that, for these datasets, a higher feature-discarding rate may start to remove features that are critical for maintaining optimal classification performance.

This analysis highlights the importance of tuning the feature-discarding rate to balance removing redundant information and preserving critical features for optimal classification performance. The results underscore the necessity of dataset-specific tuning to achieve the best results, demonstrating the robustness and adaptability of fasTNT in feature selection and classification tasks.

5.2. Comparative Evaluation of fasTNT and TextNetTopics Performance

This section compares the performance of fasTNT with a 60% feature-discarding rate against TextNetTopics for the WOS-5376 and LitCovid datasets. Additionally, it evaluates the performance of fasTNT with a 40% feature-discarding rate against TextNetTopics for the MultiLabel and arXiv datasets. The chosen v% values for fasTNT (60% and 40%) are based on their ability to yield maximum performance comparable to TextNetTopics, as demonstrated in the previous section.

As shown in Table 1, Table 2 and Table 3, fasTNT demonstrates superior performance across various performance metrics (i.e., accuracy, f1-score, precision, recall, AUC) along different aggregated top-ranked topic clusters derived from four datasets (WOS-5736, LitCovid, MultiLabel, and arXiv). Additionally, fasTNT achieves similar or higher performance than TextNetTopics with significantly fewer features.

For instance, with the WOS-5736 dataset (refer to Figure 13), fasTNT (v = 60%) attains an F1-score of 96.9% with just 72 words, compared to 212 words for TextNetTopics, marking a 66% reduction (140 words). Similarly, for the LitCovid dataset (refer to Figure 14), fasTNT (v = 60%) reaches an F1-score of 91.0% using 108 words, while TextNetTopics requires 262 words, resulting in a 59% reduction (154 words). For the MultiLabel dataset (refer to Figure 15), fasTNT (v = 40%) achieves an 85.1% F1-score with 113 words, compared to 197 words for TextNetTopics, indicating a 43% reduction (84 words). For the arXiv dataset (refer to Figure 16), fasTNT (v = 40%) secures a 92.8% F1-score with 90 words, whereas TextNetTopics needs 170 words, achieving a 47% reduction (80 words). These results underscore fasTNT’s efficiency in maintaining high classification performance while significantly reducing the feature set size, making it a robust solution for text classification tasks.

Figure 17 and Figure 18 summarize the percentages of feature reduction achieved by fasTNT compared to TextNetTopics across various feature-discarding percentages (v%). The figures highlight the ability of fasTNT to maintain F1-score performance comparable to the maximum score obtained by TextNetTopics while discarding a significant proportion of lower-importance features. For instance, for the arXiv dataset, setting v% to 10%, 20%, 30%, and 40% yields a steady increase of 12%, 18%, 31%, and 47% feature reduction, respectively, while maintaining an F1-score of 92.8.

The superior performance of fasTNT can be attributed to several key factors. First, the use of Random Forest-based feature importance values ensures that the most significant features are retained. By discarding the least informative features, fasTNT effectively reduces redundancy and noise, enhancing the relevance of the remaining features in LDA topics, which leads to better model training and performance. Second, the integration of fastText word embeddings helps capture semantic relationships between words more effectively than the traditional LDA approach used in TextNetTopics. This allows fasTNT to form more coherent and discriminative topic clusters. Lastly, fasTNT’s scoring method, which trains all topics simultaneously and considers all topic interactions, enhances the topic cluster selection process and minimizes computational overhead, making it more scalable and practical for large datasets. These enhancements collectively contribute to fasTNT’s superior performance and efficiency.

According to Table 2 and Table 3, as we expand the feature set by accumulating top-ranked topic clusters, we observe an improvement in all performance measures. These results highlight the importance of the discovered topic clusters by fasTNT in differentiating classes for this classification task. However, in some instances, adding more topic clusters might introduce noise into the feature set, reducing model performance. In such cases, fasTNT effectively identifies the optimal number of topic clusters to include, thereby creating an optimal feature set that minimizes computational cost while maximizing performance.

6. Conclusions

In this study, we introduced fasTNT, an enhanced version of the TextNetTopics algorithm, designed to address its key limitations. TextNetTopics, while innovative in its topic modeling-based approach for training text classification models, suffers from several drawbacks, including the inclusion of redundant or irrelevant features, computationally intensive topic-scoring mechanisms, and a lack of explicit semantic modeling.

fasTNT effectively addresses these shortcomings by refining feature selection through Random Forest-based importance values, thereby discarding irrelevant features and enhancing the relevance of remaining features within LDA topics. Integration of fastText word embeddings enriches semantic relationships among words, leading to more cohesive and discriminative topic clusters compared to traditional LDA methods. Furthermore, fasTNT’s streamlined scoring method concurrently trains and evaluates topics, considering the interaction between topics’ features. This approach improves topic cluster selection, minimizes computational overhead, and enhances scalability, making fasTNT suitable for handling large-scale datasets with numerous topic clusters.

Our experimental results on four diverse datasets demonstrate that fasTNT outperforms the original TextNetTopics algorithm, achieving superior classification performance. These advancements underscore fasTNT as a robust solution for text classification tasks, promising improved model performance and practical application in real-world scenarios.

In future work, we aim to find the optimal feature-discarding percentage (v%) automatically using heuristic approaches, which will allow us to fine-tune fasTNT more efficiently and effectively. Moreover, we plan to experiment with different word embedding sources such as GloVe, Word2Vec, and others to further enhance the semantic representation of words within topic clusters. Finally, we will explore the impact of varying feature selection techniques within the F component and aggregation functions within the S component on the overall performance of fasTNT.

Author Contributions

Conceptualization, D.V.; Methodology, D.V.; Software, D.V.; Validation, D.V.; Formal analysis, D.V.; Investigation, D.V.; Writing—original draft, D.V.; Writing—review & editing, D.V. and R.J.; Visualization, D.V.; Supervision, R.J. and M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

Al-Quds University partially funded the Article Processing Charge (APC) for this research article, with the remaining amount covered by the author.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: Web of Science Dataset. Available online: https://data.mendeley.com/datasets/9rw3vkcfy4/2, LitCovid Dataset. Available online: https://drive.google.com/drive/folders/1mOmCy6mbBWXmfSzDyb6v4pG6pO-t_4At, arXiv Paper Abstracts. Available online: https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts, Multi-Label Classification Dataset. Available online: https://www.kaggle.com/datasets/shivanandmn/multilabel-classification-dataset.

Conflicts of Interest

The authors declare no conflict of interest.

References

Gasparetto, A.; Marcuzzo, M.; Zangari, A.; Albarelli, A. A Survey on Text Classification Algorithms: From Text to Predictions. Information 2022, 13, 83. [Google Scholar] [CrossRef]
Deng, X.; Li, Y.; Weng, J.; Zhang, J. Feature selection for text classification: A review. Multimed. Tools Appl. 2019, 78, 3797–3816. [Google Scholar] [CrossRef]
Pintas, J.T.; Fernandes, L.A.F.; Garcia, A.C.B. Feature selection methods for text classification: A systematic literature review. Artif. Intell. Rev. 2021, 54, 6149–6200. [Google Scholar] [CrossRef]
Venkatesh, B.; Anuradha, J. A Review of Feature Selection and Its Methods. Cybern. Inf. Technol. 2019, 19, 3–26. [Google Scholar] [CrossRef]
Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature Selection: A Data Perspective. ACM Comput. Surv. 2018, 50, 1–45. [Google Scholar] [CrossRef]
Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan, A. Topic modeling algorithms and applications: A survey. Inf. Syst. 2023, 112, 102131. [Google Scholar] [CrossRef]
Pfeifer, D.; Leidner, J.L. A Study on Topic Modeling for Feature Space Reduction in Text Classification. In Flexible Query Answering Systems; Cuzzocrea, A., Greco, S., Larsen, H.L., Saccà, D., Andreasen, T., Christiansen, H., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 403–412. [Google Scholar]
Yousef, M.; Qundus, J.A.; Peikert, S.; Paschke, A. TopicsRanksDC: Distance-Based Topic Ranking Applied on Two-Class Data. In Database and Expert Systems Applications; Kotsis, G., Tjoa, A.M., Khalil, I., Fischer, L., Moser, B., Mashkoor, A., Sametinger, J., Fensel, A., Martinez-Gil, J., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 11–21. [Google Scholar]
Yousef, M.; Voskergian, D. TextNetTopics: Text Classification Based Word Grouping as Topics and Topics’ Scoring. Front. Genet. 2022, 13, 893378. [Google Scholar] [CrossRef]
Voskergian, D.; Bakir-Gungor, B.; Yousef, M. TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information. Front. Genet. 2023, 14, 1243874. [Google Scholar] [CrossRef]
Yousef, M.; Allmer, J.; İnal, Y.; Gungor, B.B. G-S-M: A Comprehensive Framework for Integrative Feature Selection in Omics Data Analysis and Beyond. 2024. Available online: https://biorxiv.org/lookup/doi/10.1101/2024.03.30.585514 (accessed on 15 June 2024).
Kuzudisli, C.; Bakir-Gungor, B.; Bulut, N.; Qaqish, B.; Yousef, M. Review of feature selection approaches based on grouping of features. PeerJ 2023, 11, e15666. [Google Scholar] [CrossRef]
Yousef, M.; Kumar, A.; Bakir-Gungor, B. Application of Biological Domain Knowledge Based Feature Selection on Gene Expression Data. Entropy 2020, 23, 2. [Google Scholar] [CrossRef]
Al-Salemi, B.; Ab Aziz, M.o.h.d.J.; Noah, S.A. LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization. J. Inf. Sci. 2015, 41, 27–40. [Google Scholar] [CrossRef]
Alhaj, F.; Al-Haj, A.; Sharieh, A.; Jabri, R. Improving Arabic Cognitive Distortion Classification in Twitter using BERTopic. IJACSA 2022, 13, 854–860. [Google Scholar] [CrossRef]
Glazkova, A. Using topic modeling to improve the quality of age-based text classification. In Proceedings of the CEUR Workshop Proceedings, Khabarovsk, Russia, 14–16 September 2021; pp. 92–97. [Google Scholar]
Rijcken, E.; Kaymak, U.; Scheepers, F.; Mosteiro, P.; Zervanou, K.; Spruit, M. Topic Modeling for Interpretable Text Classification From EHRs. Front. Big Data 2022, 5, 846930. [Google Scholar] [CrossRef]
Zhang, Z.; Phan, X.-H.; Horiguchi, S. An Efficient Feature Selection Using Hidden Topic in Text Categorization. In Proceedings of the 22nd International Conference on Advanced Information Networking and Applications—Workshops (AINA Workshops 2008), Gino-wan, Japan, 25–28 March 2008; pp. 1223–1228. [Google Scholar]
Tasci, S.; Gungor, T. LDA-based keyword selection in text categorization. In Proceedings of the 2009 24th International Symposium on Computer and Information Sciences, Guzelyurt, Cyprus, 14–16 September 2009; pp. 230–235. [Google Scholar]
Al-Salemi, B.; Ayob, M.; Noah, S.A.M.; Ab Aziz, M.J. Feature selection based on supervised topic modeling for boosting-based multi-label text categorization. In Proceedings of the 2017 6th International Conference on Electrical Engineering and Informatics (ICEEI), Langkawi, Malaysia, 25–27 November 2017; pp. 1–6. [Google Scholar]
Mohammed, S.H.; Al-augby, S. Lsa & lda topic modeling classification: Comparison study on e-books. Indones. J. Electr. Eng. Comput. Sci. 2020, 19, 353–362. [Google Scholar] [CrossRef]
Mifrah, S.; Benlahmar, E.H. Topic modeling coherence: A comparative study between LDA and NMF models using COVID’19 corpus. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 5756–5761. [Google Scholar]
Yousef, M.; Abdallah, L.; Allmer, J. maTE: Discovering expressed interactions between microRNAs and their targets. Bioinformatics 2019, 35, 4020–4028. [Google Scholar] [CrossRef]
Yousef, M.; Ozdemir, F.; Jaber, A.; Allmer, J.; Bakir-Gungor, B. PriPath: Identifying dysregulated pathways from differential gene expression via grouping, scoring, and modeling with an embedded feature selection approach. BMC Bioinform. 2023, 24, 60. [Google Scholar] [CrossRef]
Qumsiyeh, E.; Showe, L.; Yousef, M. GediNET for discovering gene associations across diseases using knowledge based machine learning approach. Sci. Rep. 2022, 12, 19955. [Google Scholar] [CrossRef]
Yousef, M.; Goy, G.; Mitra, R.; Eischen, C.M.; Jabeer, A.; Bakir-Gungor, B. miRcorrNet: Machine learning-based integration of miRNA and mRNA expression profiles, combined with feature grouping and ranking. PeerJ 2021, 9, e11458. [Google Scholar] [CrossRef]
Unlu Yazici, M.; Marron, J.S.; Bakir-Gungor, B.; Zou, F.; Yousef, M. Invention of 3Mint for feature grouping and scoring in multi-omics. Front. Genet. 2023, 14, 1093326. [Google Scholar] [CrossRef]
Ersoz, N.S.; Bakir-Gungor, B.; Yousef, M. GeNetOntology: Identifying Affected Gene Ontology Groups via Grouping, Scoring and Modelling from Gene Expression Data utilizing Biological Knowledge Based Machine Learning. Front. Genet. 2023, 14, 1139082. [Google Scholar]
Bakir-Gungor, B.; Temiz, M.; Jabeer, A.; Wu, D.; Yousef, M. microBiomeGSM: The identification of taxonomic biomarkers from metagenomic data using grouping, scoring and modeling (G-S-M) approach. Front. Microbiol. 2023, 14, 1264941. [Google Scholar] [CrossRef]
Qumsiyeh, E.; Salah, Z.; Yousef, M. miRGediNET: A comprehensive examination of common genes in miRNA-Target interactions and disease associations: Insights from a grouping-scoring-modeling approach. Heliyon 2023, 9, e22666. [Google Scholar] [CrossRef] [PubMed]
Jabeer, A.; Temiz, M.; Bakir-Gungor, B.; Yousef, M. miRdisNET: Discovering microRNA biomarkers that are associated with diseases utilizing biological knowledge-based machine learning. Front. Genet. 2023, 13, 1076554. [Google Scholar] [CrossRef]
Yousef, M.; Goy, G.; Bakir-Gungor, B. miRModuleNet: Detecting miRNA-mRNA Regulatory Modules. Front. Genet. 2022, 13, 767455. [Google Scholar] [CrossRef]
Yousef, M.; Ülgen, E.; Uğur Sezerman, O. CogNet: Classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis. PeerJ Comput. Sci. 2021, 7, e336. [Google Scholar] [CrossRef] [PubMed]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Zhang, C.; Li, Y.; Yu, Z.; Tian, F. Feature selection of power system transient stability assessment based on random forest and recursive feature elimination. In Proceedings of the 2016 IEEE PES Asia-Pacific Power and Energy Engineering Conference (APPEEC), Xi’an, China, 25–28 October 2016; pp. 1264–1268. [Google Scholar]
Han, H.; Guo, X.; Yu, H. Variable selection using Mean Decrease Accuracy and Mean Decrease Gini based on Random Forest. In Proceedings of the 2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 26–28 August 2016; pp. 219–224. [Google Scholar]
Strobl, C.; Boulesteix, A.-L.; Kneib, T.; Augustin, T.; Zeileis, A. Conditional variable importance for random forests. BMC Bioinform. 2008, 9, 307. [Google Scholar] [CrossRef]
Selva Birunda, S.; Kanniga Devi, R. A Review on Word Embedding Techniques for Text Classification. In Innovative Data Communication Technologies and Application; Raj, J.S., Iliyasu, A.M., Bestak, R., Baig, Z.A., Eds.; Springer: Singapore, 2021; pp. 267–281. [Google Scholar]
Bhatia, K.; Mishra, S.; Sharma, A. Clustering Glossary Terms Extracted from Large-Sized Software Requirements using FastText. In Proceedings of the 13th Innovations in Software Engineering Conference (Formerly Known as India Software Engineering Conference), Jabalpur, India, 27–29 February 2020; pp. 1–11. [Google Scholar]
Same-Size k-Means—Adm. Available online: https://hub.knime.com/adm/spaces/Public/Components/Same-size%20k-Means~H_koFGbfWlgR5eXS/current-state (accessed on 15 June 2024).
Kowsari, K.; Brown, D.E.; Heidarysafa, M.; Meimandi, K.J.; Gerber, M.S.; Barnes, L.E. HDLTex: Hierarchical Deep Learning for Text Classification. arXiv 2017. [Google Scholar] [CrossRef]
LitCovid Dataset. Available online: https://drive.google.com/drive/folders/1mOmCy6mbBWXmfSzDyb6v4pG6pO-t_4At (accessed on 15 June 2024).
arXiv Paper Abstracts. Available online: https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts (accessed on 15 June 2024).
Multi-Label Classification Dataset. Available online: https://www.kaggle.com/datasets/shivanandmn/multilabel-classification-dataset (accessed on 15 June 2024).
Daniel2vosk: Daniel2vosk/fastnt. 2024. Available online: https://github.com/Daniel2vosk/fastnt (accessed on 15 June 2024).
Daniel2vosk/fastnt. Available online: https://hub.knime.com/search?q=fastnt (accessed on 15 June 2024).
Newman, D.; Asuncion, A.; Smyth, P.; Welling, M. Distributed algorithms for topic models. J. Mach. Learn. Res. 2009, 10, 1801–1828. [Google Scholar]
Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. Learning Word Vectors for 157 Languages. arXiv 2018, arXiv:1802.06893. [Google Scholar] [CrossRef]

Figure 1. The general framework of the fasTNT approach, illustrating the working mechanisms of the T, F, and G components (Part 1).

Figure 2. The general framework of the fasTNT approach, illustrating the working mechanisms of the S and M components (Part 2). The bullet is intended to indicate the scoring of the remaining topic clusters.

Figure 3. The working mechanism of the T component.

Figure 4. The working mechanism of the F component.

Figure 5. The working mechanism of the G component.

Figure 6. The working mechanism of the S component (feature importance-based topic scoring).

Figure 7. The working mechanism of the M component (first iteration). The red border covers the topic clusters and their corresponding word lists utilized for the training and testing processes in the specified iteration.

Figure 8. The working mechanism of the M component (second iteration). The red border covers the topic clusters and their corresponding word lists utilized for the training and testing processes in the specified iteration.

Figure 9. F1-score performance of fasTNT across various feature-discarding percentages (v%) when utilizing the WOS-5736 dataset. The circles on the line represent the number of accumulated topic clusters.

Figure 10. F1-score performance of fasTNT across various feature-discarding percentages (v%) when utilizing the LitCovid dataset. The circles on the line represent the number of accumulated topic clusters.

Figure 11. F1-score performance of fasTNT across various feature-discarding percentages (v%) when utilizing the MultiLabel dataset. The circles on the line represent the number of accumulated topic clusters.

Figure 12. F1-score performance of fasTNT across various feature-discarding percentages (v%) when utilizing the arXiv dataset. The circles on the line represent the number of accumulated topic clusters.

Figure 13. F1-score performance comparison of fasTNT and TextNetTopics when utilizing the WOS-5736 dataset. The maximum and the minimum F1-scores attained by each algorithm are highlighted. The circles on the line represent the number of accumulated topic clusters.

Figure 14. F1-score performance comparison of fasTNT and TextNetTopics when utilizing the LitCovid dataset. The maximum and the minimum F1-scores attained by each algorithm are highlighted. The circles on the line represent the number of accumulated topic clusters.

Figure 15. F1-score performance comparison of fasTNT and TextNetTopics when utilizing the MultiLabel dataset. The maximum and the minimum F1-scores attained by each algorithm are highlighted. The circles on the line represent the number of accumulated topic clusters.

Figure 16. F1-score performance comparison of fasTNT and TextNetTopics when utilizing the arXiv dataset. The maximum and the minimum F1-scores attained by each algorithm are highlighted. The circles on the line represent the number of accumulated topic clusters.

Figure 17. Percentage of feature reduction achieved by fasTNT over TextNetTopics for the WOS-5736 and Litcovid datasets when reaching specific F1-score performance.

Figure 18. Percentage of feature reduction achieved by fasTNT over TextNetTopics for the MultiLabel and arXiv datasets when reaching specific F1-score performance.

Table 1. Performance metrics of TextNetTopics utilizing various datasets.

	Accumulated Topics	Words	Accuracy	Recall	Specificity	F-Measure	AUC	Precision	Cohen’s Kappa
WOS-5736 dataset	1	19.3	86.4	81.3	91.6	85.6	90.3	90.5	72.9
	2	28	86.9	82.9	90.8	86.3	91.3	89.9	73.7
	3	42.5	90.1	88.2	91.9	89.7	94.6	91.4	80.1
	4	56.2	93.2	93.7	92.7	93.2	97.3	92.7	86.3
	6	79	94.1	94.9	93.4	94.1	98.3	93.4	88.2
	8	93.4	94.3	95.0	93.5	94.3	98.6	93.5	88.5
	10	114.2	94.4	95.5	93.4	94.5	98.7	93.5	88.9
	12	137	95.2	96.2	94.2	95.2	99.0	94.2	90.4
	14	162	96.1	96.9	95.4	96.1	99.3	95.4	92.3
	16	182	96.3	97.1	95.5	96.3	99.4	95.5	92.6
	18	202.4	96.7	97.7	95.8	96.7	99.6	95.8	93.5
	20	219	96.7	97.8	95.6	96.7	99.5	95.6	93.4
LitCovid dataset	1	20	84.1	85.3	82.9	84.2	91.4	83.2	68.2
	2	35	86.4	87.8	85.1	86.6	93.2	85.5	72.9
	3	48	86.9	88.3	85.6	87.1	93.9	85.9	73.8
	4	62.2	87.9	88.6	87.1	87.9	94.5	87.3	75.8
	6	84.6	89.3	89.9	88.8	89.4	95.5	88.8	78.7
	8	116	89.9	90.8	89.0	90.0	95.9	89.2	79.8
	10	141.3	90.0	91.1	88.9	90.1	96.1	89.1	80.0
	12	173	90.0	91.0	89.1	90.1	96.2	89.2	80.0
	14	202.2	90.8	91.8	89.7	90.7	96.5	89.9	81.5
	16	231.6	90.7	91.3	90.1	90.8	96.6	90.2	81.5
	18	262	90.9	91.6	90.1	90.9	96.6	90.2	81.7
	20	299	91.0	92.1	89.9	91.1	96.9	90.1	82.0
MultiLabel dataset	1	20	74.5	75.0	73.9	74.6	80.4	74.3	48.9
	2	35	79.0	83.2	74.7	77.8	85.4	76.7	57.9
	3	45.8	79.4	83.6	75.1	80.2	86.3	77.1	58.7
	4	55.3	79.9	84.8	74.9	80.8	87.1	77.2	59.7
	6	72.2	80.5	86.0	75.0	81.5	88.3	77.5	61.1
	8	85.1	81.2	86.9	75.5	82.2	88.9	78.0	62.3
	10	105.5	82.5	88.4	76.7	83.5	90.2	79.2	65.1
	12	124.4	83.3	88.6	78.0	84.2	90.8	80.2	66.7
	14	142.7	83.8	88.8	78.7	84.5	91.4	80.7	67.5
	16	158.7	83.9	88.8	79.1	84.7	91.5	81.0	67.9
	18	179.7	84.2	89.1	79.2	84.9	91.5	81.1	68.3
	20	197	84.5	89.9	79.1	85.1	91.8	81.2	69.0
arXiv dataset	1	20	85.9	83.4	88.3	85.2	91.7	87.0	71.8
	2	32	88.0	86.1	89.7	87.4	94.6	88.8	75.9
	3	42.2	88.1	86.5	89.5	87.6	94.1	88.6	76.1
	4	54.4	89.1	88.4	89.7	88.7	95.3	89.0	78.1
	6	76	90.0	89.9	90.1	89.7	96.3	89.5	79.9
	8	94.7	91.8	92.3	91.3	91.6	97.3	90.9	83.6
	10	107.2	92.4	93.3	91.5	92.3	97.5	91.2	84.8
	12	116	92.5	93.1	91.9	92.3	97.7	91.6	85.0
	14	132	93.0	93.3	92.6	92.8	97.9	92.3	85.9
	16	150.3	93.0	93.1	92.9	92.8	97.8	92.5	85.9
	18	163.7	92.8	93.0	92.6	92.6	97.9	92.2	85.5
	20	175	92.6	92.9	92.2	92.4	97.9	91.9	85.1

Table 2. Performance metrics of fasTNT with 60% feature discarding utilizing WOS-5736 and LitCovid datasets.

	Accumulated Topic Clusters	Words	Accuracy	Recall	Specificity	F-Measure	AUC	Precision	Cohen’s Kappa
WOS-5736 dataset	1	4.9	79.3	61.9	96.3	74.6	80.1	94.5	58.4
	2	9.8	87.3	81.2	93.4	86.3	89.6	92.5	74.6
	3	14.5	90.5	87.5	93.5	90.1	93.2	93.0	81.0
	4	19.2	91.7	89.3	94.0	91.4	94.2	93.7	83.4
	6	28.7	94.2	93.8	94.6	94.2	97.2	94.5	88.4
	8	37.5	94.9	95.3	94.6	95.0	98.3	94.5	89.8
	10	46.2	95.7	96.1	95.4	95.7	98.8	95.4	91.5
	12	54.4	96.0	96.6	95.5	96.0	99.0	95.5	92.1
	14	63.1	96.6	97.1	96.1	96.6	99.2	96.1	93.2
	16	71.6	96.8	97.3	96.2	96.8	99.4	96.2	93.6
	18	80	97.0	97.8	96.3	97.0	99.5	96.4	94.1
	20	88	96.9	97.5	96.3	96.9	99.5	96.3	93.9
LitCovid dataset	1	6	76.3	75.2	77.4	76.0	80.1	77.0	52.7
	2	12	81.6	81.9	81.3	81.6	86.8	81.4	63.2
	3	18	84.3	85.5	83.0	84.4	91.0	83.4	68.5
	4	24	86.2	87.1	85.3	86.3	92.5	85.5	72.4
	6	36	87.7	88.9	86.6	87.8	94.1	86.8	75.4
	8	48	88.9	89.9	87.9	89.0	95.1	88.1	77.8
	10	60	89.5	89.9	89.1	89.5	95.8	89.1	79.0
	12	72	90.1	90.3	89.9	90.1	96.2	89.9	80.1
	14	84	90.4	90.9	89.9	90.5	96.5	90.0	80.9
	16	96	90.5	91.1	90.0	90.6	96.5	90.1	81.1
	18	108	90.9	91.6	90.3	91.0	96.7	90.4	81.9
	20	120	90.8	91.7	89.9	90.9	96.9	90.1	81.6

Table 3. Performance metrics of fasTNT with 40% feature discarding utilizing MultiLabel and arXiv datasets.

	Accumulated Topic Clusters	Words	Accuracy	Recall	Specificity	F-Measure	AUC	Precision	Cohen’s Kappa
MultiLabel dataset	1	6	68.9	62.0	75.8	66.5	71.3	72.0	37.8
	2	12	73.6	73.0	74.2	73.4	77.9	73.9	47.3
	3	18	76.3	77.9	74.8	76.7	81.1	75.6	52.7
	4	24	77.8	82.1	73.6	78.7	83.7	75.7	55.7
	6	35.9	79.9	86.2	73.6	81.1	86.0	76.6	59.8
	8	47.9	81.1	87.5	74.7	82.2	88.2	77.6	62.2
	10	59.8	81.9	88.3	75.6	83.0	89.2	78.3	63.9
	12	71.6	82.9	89.6	76.1	83.9	89.9	79.0	65.7
	14	83.5	83.2	89.3	77.0	84.1	90.5	79.6	66.3
	16	95.5	83.7	90.0	77.3	84.6	90.8	79.9	67.3
	18	107.4	83.8	89.9	77.7	84.9	90.8	80.2	67.6
	20	119	84.5	90.1	79.0	85.3	91.9	81.1	69.1
arXiv dataset	1	5.5	68.7	55.8	80.9	63.3	71.4	73.7	36.9
	2	11.2	79.6	75.3	83.6	78.1	85.6	81.5	59.0
	3	16.5	83.1	80.5	85.6	82.2	89.7	84.1	66.2
	4	21.9	86.0	84.9	87.1	85.5	92.7	86.1	72.0
	6	32.9	89.6	89.3	89.9	89.3	95.5	89.3	79.2
	8	43.5	90.8	90.8	90.8	90.5	96.4	90.3	81.6
	10	54.2	91.5	91.6	91.4	91.3	96.9	91.0	83.0
	12	64.5	92.2	92.3	92.1	92.0	97.2	91.7	84.4
	14	74.7	92.3	92.8	91.8	92.1	97.5	91.5	84.6
	16	85	92.7	93.1	92.2	92.5	97.7	91.9	85.3
	18	95	92.6	93.1	92.1	92.4	97.8	91.7	85.2
	20	105	92.8	93.5	92.0	92.6	97.8	91.7	85.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Voskergian, D.; Jayousi, R.; Yousef, M. Enhanced TextNetTopics for Text Classification Using the G-S-M Approach with Filtered fastText-Based LDA Topics and RF-Based Topic Scoring: fasTNT. Appl. Sci. 2024, 14, 8914. https://doi.org/10.3390/app14198914

AMA Style

Voskergian D, Jayousi R, Yousef M. Enhanced TextNetTopics for Text Classification Using the G-S-M Approach with Filtered fastText-Based LDA Topics and RF-Based Topic Scoring: fasTNT. Applied Sciences. 2024; 14(19):8914. https://doi.org/10.3390/app14198914

Chicago/Turabian Style

Voskergian, Daniel, Rashid Jayousi, and Malik Yousef. 2024. "Enhanced TextNetTopics for Text Classification Using the G-S-M Approach with Filtered fastText-Based LDA Topics and RF-Based Topic Scoring: fasTNT" Applied Sciences 14, no. 19: 8914. https://doi.org/10.3390/app14198914

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced TextNetTopics for Text Classification Using the G-S-M Approach with Filtered fastText-Based LDA Topics and RF-Based Topic Scoring: fasTNT

Abstract

1. Introduction

2. Related Work

2.1. Relevant Studies

2.2. Latent Derelict Allocation

2.3. G-S-M Approach

2.4. Random Forest

2.5. Feature-Importance-Based RF

2.6. fastText

3. The Proposed Methodology

3.1. T Component: Extracting LDA Topics

3.2. F Component: Filtering Features from LDA Topics

3.3. G Component: Reforming LDA Topics as Topic Clusters

3.4. S Component: Scoring Topic Clusters

3.5. M Component: Stepwise Topic Cluster Aggregation and Model Training

4. Experimental Work

4.1. Datasets

4.2. Text Preprocessing

4.3. Experimental Setup

4.4. Evaluation Metrics

5. Experimental Results

5.1. Exploring the Impact of Feature Discarding Percentages on fasTNT Performance

5.2. Comparative Evaluation of fasTNT and TextNetTopics Performance

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI