A Study on Performance Enhancement by Integrating Neural Topic Attention with Transformer-Based Language Model

Um, Taehum; Kim, Namhyoung

doi:10.3390/app14177898

Open AccessArticle

A Study on Performance Enhancement by Integrating Neural Topic Attention with Transformer-Based Language Model

by

Taehum Um

and

Namhyoung Kim

^*

Department of Applied Statistics, Gachon University, 1342 Seongnam-daero, Sujung-gu, Seongnam 13120, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7898; https://doi.org/10.3390/app14177898

Submission received: 23 July 2024 / Revised: 27 August 2024 / Accepted: 3 September 2024 / Published: 5 September 2024

(This article belongs to the Special Issue Techniques and Applications of Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

:

As an extension of the transformer architecture, the BERT model has introduced a new paradigm for natural language processing, achieving impressive results in various downstream tasks. However, high-performance BERT-based models—such as ELECTRA, ALBERT, and RoBERTa—suffer from limitations such as poor continuous learning capability and insufficient understanding of domain-specific documents. To address these issues, we propose the use of an attention mechanism to combine BERT-based models with neural topic models. Unlike traditional stochastic topic modeling, neural topic modeling employs artificial neural networks to learn topic representations. Furthermore, neural topic models can be integrated with other neural models and trained to identify latent variables in documents, thereby enabling BERT-based models to sufficiently comprehend the contexts of specific fields. We conducted experiments on three datasets—Movie Review Dataset (MRD), 20Newsgroups, and YELP—to evaluate our model’s performance. Compared to the vanilla model, the proposed model achieved an accuracy improvement of 1–2% for the ALBERT model in multiclassification tasks across all three datasets, while the ELECTRA model showed an accuracy improvement of less than 1%.

Keywords:

natural language processing; neural topic model; ELECTRA; ALBERT; multi-classification

1. Introduction

In the field of natural language processing (NLP), significant research has been conducted to enhance model performance by introducing and modifying deep learning architectures. To address the long-term dependency challenge of recurrent neural networks (RNNs), models such as long short-term memory (LSTM) [1] and the gated recurrent unit (GRU) [2] have introduced gate structures. The transformer [3] combines an encoder-decoder structure with an attention mechanism [4], eliminating the reliance on an RNN architecture. The BERT architecture [5], built on a transformer encoder, has demonstrated outstanding performance, surpassing existing models in various downstream tasks. Subsequent models based on BERT—such as ALBERT [6], ELECTRA [7], and RoBERTa [8]—exhibited further improvements in efficiency and performance.

Research in NLP continues to advance rapidly, with notable models such as GPT-4 [9], which employs a transformer-based architecture similar to BERT but operates in an auto-regressive manner for text generation. Additionally, DeBERTa-V3 [10] addresses the limitations of both BERT and RoBERTa to achieve even higher performance levels. Meanwhile, LLaMA [11] shows cases of a more parameter-efficient design compared to other large language models while still delivering superior performance.

Nonetheless, because pretraining these models requires extensive computation [12], continuous learning is often difficult, and the trained models may fail to accurately interpret domain-specific documents [13]. One plausible approach to addressing these issues is the use of topic models. Topic modeling is an unsupervised learning method that discovers latent topics in documents composed of words, yielding a deeper understanding of core information contained within the documents. Historically, probabilistic topic models, such as latent Dirichlet allocation (LDA) [14], have been widely used for this task. However, topic models based on Bayesian probabilities have three main drawbacks [15]. First, as model complexity increases, so does the complexity of inference, making it difficult to automate the inference process. Furthermore, it is challenging to efficiently process large text sets because parallel computing technologies such as GPUs cannot be utilized. Finally, these models cannot be trained in conjunction with artificial neural networks, including deep neural networks.

The neural topic model (NTM) has emerged to address these limitations. Unlike conventional topic modeling methods such as LDA, NTM uses artificial neural networks to discover latent topics in documents without relying on probabilistic approaches. This allows various frameworks to be applied to configure neural topic modeling.

In this study, we propose a method to address the challenges of continuous learning and the limitations of pre-trained models, which often struggle to accurately comprehend domain-specific documents. Our approach simultaneously utilizes pre-trained models and NTMs to create a more effective solution. Instead of simply loading pre-trained models and applying them directly to the task, we incorporate a brief additional training step after loading the pre-trained models. This step occurs before executing the main task and is designed to refine the language model’s capabilities through a straightforward fine-tuning process, eliminating the need for extensive pre-training. Our experimental results, obtained from three benchmark datasets—MRD, 20Newsgroups, and YELP—demonstrate the enhanced performance of the proposed approach, validating our methodology.

The structure of the paper is as follows: Section 2 explores traditional topic modeling and subsequent developments in neural topic modeling, providing a foundational understanding of the field. Section 3 focuses on the fundamental BERT-based models and outlines the architecture of our proposed model, highlighting its unique features and advantages. Section 4 details the datasets and hyperparameters utilized in our experiments, presents the experimental results, and includes additional analyses based on variations in specific hyperparameters. Finally, Section 5 offers a comprehensive summary of the experiments and discusses potential avenues for future research, emphasizing the significance of our findings and their implications for ongoing developments in this area.

2. Related Works

2.1. Traditional Topic Modeling

Since LDA has become a commonly used topic modeling method, various extensions of LDA have been developed to solve specific tasks efficiently. For example, supervised LDA (SLDA) [16] extends LDA by incorporating various response types assigned to each document. SLDA jointly models documents and responses to find latent topics that best predict the response variables for future unlabeled documents.

Another extension, labeled LDA (L-LDA) [17], learns from documents labeled with multiple tags. This model can perform appropriate word tagging for untagged documents, showing improved expressiveness compared with LDA. Both SLDA and L-LDA represent transitions from unsupervised to supervised learning, enhancing the applicability of topic modeling.

Interdependent LDA (ILDA) [18] models the conditional interdependency between latent aspects and ratings, enabling the identification of aspects and prediction of ratings from online product reviews without human supervision. This approach addresses the need to accurately interpret customer feedback.

Frequency-LDA (FLDA) and dependency–frequency-LDA (DFLDA) [19] extend LDA to address multi-label classification problems. FLDA uses the supervision of label frequencies to control document-label Dirichlet priors, whereas DFLDA considers dependencies between labels by introducing a label-shared topic level. Both models improve classification performance by incorporating label-related information.

TopicSpam [20] leverages generative LDA-based topic modeling to detect fake reviews, outperforming existing baselines. Similarly, one study [21] conducted multilingual topic modeling of Facebook data to identify COVID-19 tracking trends by country, demonstrating the versatility of LDA in different contexts.

Finally, time-decay LDA (T-LDA) [22] represents a novel collaborative filtering (CF) algorithm that incorporates time attributes to improve performance. This model addresses the limitations of traditional CF algorithms, which do not consider the temporal aspects of user behavior.

2.2. Neural Topic Modeling

As deep learning research has become more active owing to the development of computing resources, artificial neural networks have become employed for the task of topic modeling, leading to the emergence of NTM. For instance, [23,24] constructed a neural topic model based on an autoregressive model. Additionally, [25,26] proposed a neural topic model based on generative adversarial networks (GANs), and [27,28] developed a neural topic model based on a variational autoencoder (VAE).

Some methods achieve enhanced performance and efficiency by combining topic and language models. The topic attention model [29] is a supervised neural topic model that integrates an RNN through an attention mechanism. The model was subsequently applied to regression and classification tasks to demonstrate its versatility.

tBERT [30] combines topic modeling with BERT for semantic similarity prediction, using LDA and GSDMM [31] as topic models. tBERT has demonstrated a performance improvement compared with BERT alone, highlighting the benefit of integrating topic information. TopicBERT [32] combines BERT with the neural variational document model (NVDM) [33], wherein a neural topic model is used to optimize the computational cost of document classification during the fine-tuning process.

In [34], contextualized representations were combined with NTM to significantly increase topic coherence and discover more meaningful topics compared with those produced by traditional bag-of-words topic models. This approach demonstrated the advantage of using contextual information in topic modeling.

In this study, we achieved improved performance using BERT-based models, such as ALBERT and ELECTRA, as alternatives to BERT. Specifically, we combined a neural topic model with BERT-based models using attention mechanisms and evaluated the hybrid models’ performance against existing models that do not incorporate NTM. Our results demonstrate that integrating neural topic models with BERT-based architectures can enhance performance in various tasks.

3. Proposed Model

We introduce a new model that integrates neural topic models with BERT-based networks using an attention mechanism. The baseline model served as a benchmark when evaluating the performance of the proposed model configurations.

3.1. ALBERT (ELECTRA) Baseline Model

ALBERT and ELECTRA represent extensions of the BERT model. ALBERT aims to maximize the parameter efficiency of BERT through techniques such as factorized embedding parameterization and cross-layer parameter sharing, whereas ELECTRA enhances performance by modifying the pre-training structure via methods such as token detection replacement. Whereas ALBERT focuses on minimizing the model’s size and computational burden, ELECTRA optimizes performance by distinguishing between real and replaced tokens during pre-training. Therefore, we chose to study these two models, as they represent advancements over the original BERT, aiming to further enhance performance and facilitate understanding of documents in specific domains. Throughout our experiments, the two models serve as baselines for comparison. The structures of both models are shown in Figure 1.

3.2. ALBERT (ELECTRA) Nueral Topic Model

Our neural topic model was based on the VAE [27,29]. We first examined the training method for the proposed neural topic model.

D

, which expresses the input document

d

in the form of a bag of words, passes through the encoder

f (D)

. The resulting vector subsequently passes through the hidden layers

h_{1}

and

h_{2}

to obtain the mean

μ

and variance vector

σ^{2}

. The encoder

f

and hidden layers

h_{1}

and

h_{2}

are fully connected neural networks.

μ = h_{1} (f (D)), σ^{2} = h_{2} (f (D))

(1)

The obtained mean, variance vector, and noise

ϵ

generate a latent vector

h

with the specified number of dimensions. A softmax activation function is applied to the generated vector to obtain

R

using the topic ratio information of the document.

h = μ + σ^{2} + ϵ, ϵ ~ N (0, 1) R = softmax (W_{h} h + b_{h})

(2)

After passing

R

through a decoder composed of a hidden layer, the data are reconstructed to be as similar as possible to the input value

d

. The topic vectors

t_{i} (i = 1, 2, \dots, k)

and topic distribution

A

for the words are used to train the model after configuring the initial values through Xavier initialization [35].

Next, we examine how NTM and the BERT-based models can be hybridized using the attention mechanism. Final encoder representations (

C, T_{1}, T_{2}, \dots, T_{S E P})

of input tokens

x_{n}

are obtained through a pre-trained BERT-based model. After passing the final encoder representation through the hidden layer

o

to equalize its size with the topic vector

t

, the encoder representation is combined with the topic vector by applying an attention mechanism using the topic vector

t

as the query. The equation below is a relation between the j-th topical attention weight, word step

b_{j} (j \in \{1, 2, \dots, m\})

, and topic vector

t_{i} (i \in \{1, 2, \dots, k\})

:

b_{j} = t a n h (o (C, T_{1}, T_{2}, \dots, T_{S E P})) c_{j, i} = softmax ({b_{j}}^{T} t_{i}) W_{j} = {concat (c}_{j, i})

(3)

The final topical attention weight is computed by multiplying

R

by the topic ratio. Additionally, we apply a threshold value, denoted as

α (0 < α < 1)

, to ensure that each topical attention learns independently for other topics.

Y_{j} = W_{j} (R_{i} - α) = \sum_{i = 1}^{k} c_{j, i} (R_{i} - α)

(4)

The final representation

G

of the document is obtained by combining the topical attention weights with the representation vector of each token.

G = \sum_{j = 1}^{m} v_{j} = \sum_{j = 1}^{m} Y_{j} b_{j}

(5)

Finally,

G

is applied to the desired tasks using an activation function

g

to obtain the prediction

F

.

F = g (W_{F} G + b_{F})

(6)

A detailed structure of the model is shown in Figure 2. The two variants are designated as the ALBERT neural topic model (ANTM) and ELECTRA neural topic model (ENTM).

3.3. Model Loss

The design of our loss model was inspired by the topic attention model [29] and TopicBERT [32]. Three loss functions were employed to train the proposed model. The first of these functions is the topic KL divergence, which computes the difference between the actual topic distribution

p

and the learned topic distribution

q

.

K L (q (R | d) | | p (R)) = - \frac{1}{2} \sum (1 + \log (σ^{2}) - μ^{2} - σ^{2})

(7)

Second, we implemented a data likelihood function that calculates the difference between the actual and reconstructed data. In this equation,

z_{n}

represents the latent variable, which is a topic variable related to the input token

x_{n} .

l o g p_{Θ} (d| \hat{R}) \sum_{n = 1}^{m} l o g p (x_{n}| \hat{R}) = \sum_{n = 1}^{m} \sum_{z_{n}} l o g p (x_{n}| A_{z_{n}}) p (z_{n} | \hat{R}) = \sum_{n = 1}^{m} l o g A_{x_{n}}^{T} \hat{R}

(8)

Finally, we utilized cross-entropy to calculate the difference between the model’s predictive values

P_{i c} (i \in \{1, 2, \dots, n\}, c \in {1, 2, \dots, C})

and the real data values

L_{i c}

. Here,

C

represents the number of labels and

n

indicates the volume of data.

C r o s s e n t r o p y = - \frac{1}{n} \sum_{i = 1}^{n} \sum_{c = 1}^{C} L_{i c} l o g (P_{i c})

(9)

In summary, our model loss combines topic KL divergence to align topic distributions, data likelihood to ensure the accuracy of data reconstruction, and cross-entropy to optimize classification performance.

4. Experiments

4.1. Datasets

To evaluate the proposed model, we used three datasets: MRD, 20Newsgroups, and YELP. This selection was made to ensure a diverse range of data sizes and label counts, allowing for a comprehensive assessment of model performance.

The MRD dataset consists of approximately 5000 movie reviews categorized into three sentiment classes: positive, negative, and neutral [36]. This dataset is commonly used for sentiment analysis and helps to evaluate the ability of a model to classify text based on sentiment. Previous research, such as [37], has compared the performance of proposed CNN models with traditional models using the MRD dataset, highlighting its relevance for benchmarking.

20Newsgroups is a collection of nearly 20,000 news documents categorized into 20 groups [38]. This dataset is widely used for text classification and topic modeling tasks, providing a diverse range of topics for evaluation. Each document is labeled with one of 20 categories, making the set suitable for supervised learning. Notably, [39] conducted research aimed at enhancing text classification performance using the 20Newsgroups dataset, further establishing its significance in the field.

The YELP dataset comprises approximately 6,900,000 restaurant reviews with ratings ranging from one to five stars [40]. Owing to the large size of the dataset, we randomly selected 36,000 reviews for analysis. This subset was sufficiently large to provide a representative sample for model evaluation while being computationally manageable. Research by [41] has demonstrated that the proposed neural network model outperforms state-of-the-art algorithms when evaluated on the YELP dataset.

By incorporating these three datasets, we aim to provide a robust evaluation of our proposed model across different contexts and classification challenges.

A summary of the data used in our experiments is presented in Table 1.

4.2. Hyperparameters

We employed fixed hyperparameters rather than automated approaches to ensure that various models could be compared under the same conditions. The selected hyperparameters were chosen from a range of candidates through preliminary experiments, and the specific values used in our experiments are detailed in Table 2. While we recognize that hyperparameter tuning has the potential to enhance model performance, our focus was on establishing a consistent baseline for comparison.

All experiments were conducted using an AMD Ryzen 9 5950X 16-Core Processor CPU(AMD, Sunnyvale, CA, USA) and an NVIDIA GeForce RTX 3090 GPU(NVIDIA, Santa Clara, CA, USA). We utilized Python version 3.7.11 and the PyTorch version 1.10.2 deep learning framework for the experiments. Both BERT-based models were implemented using the Transformer v. 3.0.2 library. This setup allowed us to maintain a controlled environment for evaluating the performance of the models.

4.3. Results

To evaluate model performance, we utilized the F1 score, precision, recall, and accuracy as evaluation metrics, with each experiment conducted nine times with five epochs to ensure reproducible results. These metrics are commonly used to evaluate classification task performance. The accuracy is the ratio of correctly predicted instances to the total number of instances, whereas the F1 score is the harmonic mean of precision and recall. Here, precision is the ratio of true-positive instances to total instances predicted as positive, and recall is the ratio of true positive instances to total instances that should have been predicted as positive.

The experimental results are listed in Table 3, where AVM and EVM denote the ALBERT and ELECTRA baseline models, respectively. The better results comparing the baseline model and the proposed model are highlighted in bold. The “Time” column indicates the average time required for each model to execute five epochs.

In the case of ANTM, performance for all datasets improved by approximately 1–2% over that of the baseline model. This suggests that integrating a neural topic model with ALBERT can enhance model performance. In particular, ANTM exhibited consistent improvement in terms of AUC, precision, recall, and F1 score across all datasets, indicating its robustness. Moreover, the relatively small MRD dataset showed the most significant performance improvement, indicating that our proposed model is particularly effective in enhancing performance, especially when working with limited datasets and existing models alone.

In contrast, although the ETNM model outperformed the baseline on the MRD and YELP datasets, it exhibited inferior performance to EVM on the 20Newsgroups dataset. This discrepancy may be attributed to the specific characteristics of the 20Newsgroups dataset, which might not benefit as much from NTM integration. In other words, since the existing model already achieves a high AUC of around 90%, incorporating neural topics into a well-performing model could slightly hinder its learning process.

In Table 3, our proposed model generally outperformed the baseline models. To determine whether these performance differences were statistically significant, we conducted t-tests at a significance level of 0.05, with results presented in Table 4. The accuracy, precision, recall, and F1 score of ANTM had a p-value of less than the significance level of 0.05. Therefore, the performance difference between the ANTM and its corresponding baseline was statistically significant.

For the ELECTRA model, statistically significant differences in accuracy, precision, recall, and F1 score were obtained for the MRD and YELP datasets, as the p-value was less than the significance level of 0.05. However, the differences remained statistically significant for the 20Newsgroups dataset, where the baseline outperformed the hybrid model.

Overall, our evaluation results indicate that the integration of NTM can significantly enhance performance across multiple datasets. The statistical significance of these improvements further supports the robustness of the proposed methodology.

4.4. Number of Topic Effects

In our previous experiment, we fixed the number of topics at 50. However, the performance of a neural topic model can be impacted by the number of topics. To investigate this, additional experiments were conducted by varying the number of topics from 20 to 100. The experimental results are shown in Figure 3.

For the MRD dataset, neither ANTM nor ENTM appeared to consistently exhibit improved performance as the number of topics increased. ANTM exhibited its maximal performance when the number of topics was 80, whereas ENTM achieved its highest performance when the number of topics was 60. This suggests that for the MRD dataset, there is no linear relationship between the number of topics and model performance, indicating the need for careful selection of the topic number.

For the 20Newsgropus dataset, the performance of ANTM appeared to decrease slightly with 80 topics, being optimal with 30, 50, and 70 topics. For ENTM, performance was relatively high when the number of topics was 30, 50, 70, 80, and 100. This indicates that ENTM is more robust to changes in the number of topics than ANTM, which shows more variability.

On the YELP dataset, ANTM achieved its best performance when the number of topics was 40. However, no significant differences in performance were observed for ENTM except when the number of topics was set at 70. This suggests that ENTM is less sensitive to the number of topics in the YELP dataset, whereas ANTM benefits from a specific number of topics.

From these observations, it is evident that each model’s performance changes as the number of topics varies across different datasets. For some datasets, significant performance fluctuations were observed with respect to changes in the number of topics, whereas for others, such variations were relatively negligible. Therefore, it is crucial to treat the number of topics as a hyperparameter and identify the optimal number of topics for each specific dataset.

Overall, our findings suggest that the number of topics significantly affects NTM performance; hence, careful tuning of this hyperparameter is essential for achieving optimal results across different datasets.

4.5. Dropout Effects

In addition to varying the number of topics, we conducted experiments by fixing the dropout rate by 0.1. This allowed us to evaluate potential differences in model performance associated with changes in the dropout rate. The dropout range was alternated from 0.1 to 0.5, with experimental results shown in Figure 4.

For the MRD dataset, ANTM achieved a performance peak at a dropout rate of 0.2, as well as relatively poor performance at dropout rates of 0.4 and 0.5. In contrast, ENTM exhibited its peak performance at a dropout rate of 0.3.

For the 20Newsgroups dataset, both ANTM and ENTM performed significantly worse as the dropout rate changed from 0.1 to 0.5. For the YELP dataset, performance was relatively low for dropout rates of 0.4 and 0.5, with ENTM performing the worst at a dropout rate of 0.5.

As was the case with the number of topics, a significant degradation in performance was noted with certain datasets owing to changes in the dropout rate. Moreover, the lowest performance was often observed at a dropout rate of 0.5 for all three datasets, indicating that an increase in the proportion of dropouts to prevent overfitting may degrade learning performance.

Ultimately, a comprehensive evaluation of the dropout rate’s impact on model performance suggests that selecting an appropriate dropout rate is crucial for optimizing performance, with excessively high dropout rates having an especially detrimental effect on model accuracy.

4.6. Neural Topic Ensembel Model

Throughout the previous experiments, both the baseline and proposed models performed the experimental tasks using only one model. In contrast, we propose an ensemble learning model that combines two pretrained models with the topic attention model (TAM) [29]. A pretrained BERT-based model obtains CLS tokens containing information from all encoder representations. Each received CLS token and final encoder representation m, obtained through the TAM, are then combined to obtain s, which is applied to the desired task. The resulting architecture, depicted in Figure 5, is referred to as the ALBERT neural topic ensemble model (ANEM) or ELECTRA neural topic ensemble model (ENEM).

As in the previous experiment, performance was evaluated by averaging the results of five epochs repeated nine times, with the dropout rate set to 0.1. The experimental results are listed in Table 5. The best results comparing the baseline model, the proposed model, and the ensemble model are highlighted in bold.

For the MRD and 20Newsgroups datasets, both ANEM and ENEM appeared to perform worse than the baseline models, with a significant difference for ANEM on the MRD dataset. For the YELP dataset, although the baseline model outperformed ANEM, ENEM somewhat outperformed its baseline in terms of precision, recall, and F1 score.

The underperformance of the ensemble model compared to both the baseline models and our previously proposed model may stem from its reliance on a single CLS token for integrating topic vectors. By using only one token to encapsulate all the information, the model may struggle to leverage the combined information effectively. Therefore, utilizing a topic vector for each token, which contains specific information, could facilitate more nuanced processing and lead to further performance improvements.

These results indicate that the proposed models outperformed both the baseline and ensemble models, with the exception of ELECTRA on the 20Newsgroups dataset.

In conclusion, while the proposed neural topic models demonstrated improved performance in most cases, further investigation is required to understand the performance drop observed with the ENTM on the 20Newsgropus dataset.

5. Discussion and Conclusions

In this study, we proposed a combination of BERT-based models and a neural topic model using an attention mechanism. Our results demonstrated that this approach outperformed the baseline model. In particular, the proposed model performed better when the amount of data was relatively small. In other words, when the analytical field is challenging to learn and the data size is limited, the proposed method can efficiently increase the performance through fine-tuning.

However, for the 20Newsgroups dataset, performance improvement was not achieved. This suggests that the neural topic model may not significantly affect the cases in which the baseline model performs well. This indicates a need to explore the conditions under which the proposed model provides the greatest benefits.

In this study, we focused solely on classification tasks. Future research should include additional experiments on other tasks, such as regression, to further explore the potential for performance improvement. Furthermore, conducting comparative studies that include large language models (LLMs) like GPT-4 and LLaMA, as well as other types of models such as multimodal models, would be valuable. Additionally, we used the bag-of-words method to embed the input documents into the neural topic model, which has the disadvantage of ignoring the word order. Therefore, employing embedding methods that consider the order of sentences rather than the bag-of-words method could lead to more effective performance improvements. The experiments conducted in this study revealed that the performance of the proposed model was influenced by the characteristics of the datasets and hyperparameters. Therefore, future research should focus on developing automated methods for determining the optimal number of topics to further enhance model performance. Additionally, understanding the dataset-specific characteristics that influence the effectiveness of neural topic model integration could provide deeper insights into the behavior of the model, potentially leading to more tailored and effective model adaptations.

Author Contributions

Conceptualization, T.U. and N.K.; methodology, T.U. and N.K.; software, T.U.; validation, T.U. and N.K.; formal analysis, T.U.; investigation, T.U. and N.K.; resources, N.K.; data curation, T.U.; writing—original draft preparation, T.U. and N.K.; writing—review and editing, T.U. and N.K.; visualization, T.U.; supervision, N.K.; project administration, N.K.; funding acquisition, N.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT), grant number No. 2021R1F1A1050602.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in reference numbers [36,38,40]. The source code is available at https://github.com/Umdolphin/Neural-Topic-Attention-with-Transformer-Based-Language-Model (accessed on 2 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optim. Bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
He, P.; Gao, J.; Chen, W. DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. arXiv 2021, arXiv:2111.09543. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Strubell, E.; Ganesh, A.; McCallum, A. Energy and policy considerations for deep learning in NLP. arXiv 2019, arXiv:1906.02243. [Google Scholar]
Han, X.; Zhang, Z.; Ding, N.; Gu, Y.; Liu, X.; Huo, Y.; Qiu, J.; Yao, Y.; Zhang, A.; Zhang, L.; et al. Pre-trained models: Past, present and future. AI Open 2021, 2, 225–250. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Zhao, H.; Phung, D.; Huynh, V.; Jin, Y.; Du, L.; Buntine, W. Topic modelling meets deep neural networks: A survey. arXiv 2021, arXiv:2103.00498. [Google Scholar]
Mcauliffe, J.; Blei, D. Supervised topic models. Adv. Neural Inf. Process. Syst. 2007, 20, 121–128. [Google Scholar]
Ramage, D.; Hall, D.; Nallapati, R.; Manning, C.D. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–7 August 2009; pp. 248–256. [Google Scholar]
Moghaddam, S.; Ester, M. ILDA: Interdependent LDA model for learning latent aspects and their ratings from online product reviews. In Proceedings of the 34th International ACM SIGIR Conference on RESEARCH and Development in Information Retrieval, Beijing, China, 24–28 July 2011; pp. 665–674. [Google Scholar]
Li, X.; Ouyang, J.; Zhou, X. Supervised topic models for multi-label classification. Neurocomputing 2015, 149, 811–819. [Google Scholar] [CrossRef]
Li, J.; Cardie, C.; Li, S. Topicspam: A topic-model based approach for spam detection. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 217–221. [Google Scholar]
Amara, A.; Hadj Taieb, M.A.; Ben Aouicha, M. Multilingual topic modeling for tracking COVID-19 trends based on Facebook data analysis. Appl. Intell. 2021, 51, 3052–3073. [Google Scholar] [CrossRef]
Na, L.; Ming-xia, L.; Hai-yang, Q.; Hao-long, S. A hybrid user-based collaborative filtering algorithm with topic model. Appl. Intell. 2021, 51, 7946–7959. [Google Scholar] [CrossRef]
Gupta, P.; Chaudhary, Y.; Buettner, F.; Schütze, H. Document informed neural autoregressive topic models with distributional prior. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, No. 01. pp. 6505–6512. [Google Scholar]
Larochelle, H.; Lauly, S. A neural autoregressive topic model. Adv. Neural Inf. Process. Syst. 2012, 25, 2708–2716. [Google Scholar]
Wang, R.; Zhou, D.; He, Y. Atm: Adversarial-neural topic model. Inf. Process. Manag. 2019, 56, 102098. [Google Scholar] [CrossRef]
Wang, R.; Hu, X.; Zhou, D.; He, Y.; Xiong, Y.; Ye, C.; Xu, H. Neural topic modeling with bidirectional adversarial training. arXiv 2002, arXiv:2004.12331. [Google Scholar]
Miao, Y.; Grefenstette, E.; Blunsom, P. Discovering discrete latent topics with neural variational inference. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2410–2419. [Google Scholar]
Wu, J.; Rao, Y.; Zhang, Z.; Xie, H.; Li, Q.; Wang, F.L.; Chen, Z. Neural mixed counting models for dispersed topic discovery. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6159–6169. [Google Scholar]
Wang, X.; Yang, Y. Neural topic model with attention for supervised learning. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, Online, 26–28 August 2020; pp. 1147–1156. [Google Scholar]
Peinelt, N.; Nguyen, D.; Liakata, M. tBERT: Topic models and BERT joining forces for semantic similarity detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7047–7055. [Google Scholar]
Yin, J.; Wang, J. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 233–242. [Google Scholar]
Chaudhary, Y.; Gupta, P.; Saxena, K.; Kulkarni, V.; Runkler, T.; Schütze, H. TopicBERT for energy efficient document classification. arXiv 2020, arXiv:2010.16407. [Google Scholar]
Miao, Y.; Yu, L.; Blunsom, P. Neural variational inference for text processing. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1727–1736. [Google Scholar]
Bianchi, F.; Terragni, S.; Hovy, D. Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. arXiv 2020, arXiv:2004.03974. [Google Scholar]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
Pang, B.; Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv 2005, arXiv:cs/0506075. [Google Scholar]
Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
Home Page for 20Newsgroups Data Set. Available online: http://qwone.com/~jason/20Newsgroups/ (accessed on 23 July 2024).
Jiang, M.; Liang, Y.; Feng, X.; Fan, X.; Pei, Z.; Xue, Y.; Guan, R. Text classification based on deep belief network and softmax regression. Neural Comput. Appl. 2018, 29, 61–70. [Google Scholar] [CrossRef]
Yelp Open Dataset. Available online: https://www.yelp.com/dataset (accessed on 23 July 2024).
Tang, D.; Qin, B.; Liu, T. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 1422–1432. [Google Scholar]

Figure 1. Structure of the ALBERT (ELECTRA) baseline model.

Figure 2. Structure of the ALBERT (ELECTRA) neural topic model.

Figure 3. Visualizing the effect of the number of topics.

Figure 4. Visualizing the effect of dropout.

Figure 5. Structure of ALBERTA (ELECTRA) neural topic ensemble model.

Table 1. Summary of experimental data.

Dataset	MRD	20News	YELP
# Train	4506	15,063	30,000
# Test	500	3766	6000
Task	Multi-classification (3 labels)	Multi-classification (20 labels)	Multi-classification (5 labels)

Table 2. Model hyperparameters.

Max Length	256	Dropout	0.1
Batch Size	8	Epochs	5
Learning Rate	2 × 10⁻⁵	Number of Topic	50
Optimizer	AdamW	Embedding Size	100
Weight Decay	0.01	Threshold	0.1

Table 3. Model performance results.

Dataset	MRD
Evaluation Model	AUC	Precision	Recall	F1	Time (s)
AVM	0.6604	0.6348	0.6233	0.5940	244
ANTM	0.6862	0.6593	0.6506	0.6213	262
EVM	0.7303	0.7005	0.6922	0.6677	249
ENTM	0.7390	0.7125	0.7021	0.6795	266
Dataset	20Newsgroups
Evaluation Model	AUC	Precision	Recall	F1	Time (s)
AVM	0.8694	0.8026	0.8016	0.7972	853
ANTM	0.8754	0.8121	0.8114	0.8070	929
EVM	0.8948	0.8381	0.8372	0.8333	862
ENTM	0.8883	0.8302	0.8295	0.8255	933
Dataset	YELP
Evaluation Model	AUC	Precision	Recall	F1	Time (s)
AVM	0.6830	0.5656	0.5743	0.5478	1666
ANTM	0.6871	0.5751	0.5837	0.5567	1792
EVM	0.7115	0.5956	0.6047	0.5787	1694
ENTM	0.7145	0.6050	0.6124	0.5871	1809

Table 4. Results of t-tests.

Dataset	ANTM Accuracy		ANTM Precision		ANTM Recall		ANTM F1
	Test Statistic	p-Value	Test Statistic	p-Value	Test Statistic	p-Value	Test Statistic	p-Value
MRD	−4.2576	0.0018	−3.4281	0.0048	−4.7978	0.0004	−4.1896	0.0011
20Newsgroups	−2.4514	0.0300	−2.9484	0.0112	−3.0977	0.0082	−2.9638	0.0108
YELP	−4.6724	0.0003	−5.7028	0.0001	−5.5028	0.0001	−5.9682	0.0001
Dataset	ENTM Accuracy		ENTM Precision		ENTM Recall		ENTM F1
	Test Statistic	p-Value	Test Statistic	p-Value	Test Statistic	p-Value	Test Statistic	p-Value
MRD	−2.2591	0.0382	−2.5513	0.0226	−2.6032	0.0200	−2.7763	0.0139
20Newsgroups	4.3831	0.0011	3.8788	0.0023	4.1705	0.0015	3.9874	0.0020
YELP	−2.5017	0.0240	−4.4778	0.0004	−4.4527	0.0004	−4.5023	0.0004

Table 5. Ensemble model performance.

Dataset	MRD
Evaluation Model	AUC	Precision	Recall	F1	Time (s)
AVM	0.6604	0.6348	0.6233	0.5940	244
ANTM	0.6862	0.6593	0.6506	0.6213	262
ANEM	0.6288	0.6093	0.5977	0.5684	296
EVM	0.7303	0.7005	0.6922	0.6677	249
ENTM	0.7390	0.7125	0.7021	0.6795	266
ENEM	0.7279	0.6927	0.6902	0.6697	302
Dataset	20Newsgroups
Evaluation Model	AUC	Precision	Recall	F1	Time (s)
AVM	0.8694	0.8026	0.8016	0.7972	853
ANTM	0.8754	0.8121	0.8114	0.8070	929
ANEM	0.8420	0.7678	0.7686	0.7631	1043
EVM	0.8948	0.8381	0.8372	0.8333	862
ENTM	0.8883	0.8302	0.8295	0.8255	933
ENEM	0.8888	0.8293	0.8287	0.8246	1049
Dataset	YELP
Evaluation Model	AUC	Precision	Recall	F1	Time (s)
AVM	0.6830	0.5656	0.5743	0.5478	1666
ANTM	0.6871	0.5751	0.5837	0.5567	1792
ANEM	0.6760	0.5499	0.5626	0.5339	2039
EVM	0.7115	0.5956	0.6047	0.5787	1694
ENTM	0.7145	0.6050	0.6124	0.5871	1809
ENEM	0.7114	0.5970	0.6063	0.5802	2061

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Um, T.; Kim, N. A Study on Performance Enhancement by Integrating Neural Topic Attention with Transformer-Based Language Model. Appl. Sci. 2024, 14, 7898. https://doi.org/10.3390/app14177898

AMA Style

Um T, Kim N. A Study on Performance Enhancement by Integrating Neural Topic Attention with Transformer-Based Language Model. Applied Sciences. 2024; 14(17):7898. https://doi.org/10.3390/app14177898

Chicago/Turabian Style

Um, Taehum, and Namhyoung Kim. 2024. "A Study on Performance Enhancement by Integrating Neural Topic Attention with Transformer-Based Language Model" Applied Sciences 14, no. 17: 7898. https://doi.org/10.3390/app14177898

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Study on Performance Enhancement by Integrating Neural Topic Attention with Transformer-Based Language Model

Abstract

1. Introduction

2. Related Works

2.1. Traditional Topic Modeling

2.2. Neural Topic Modeling

3. Proposed Model

3.1. ALBERT (ELECTRA) Baseline Model

3.2. ALBERT (ELECTRA) Nueral Topic Model

3.3. Model Loss

4. Experiments

4.1. Datasets

4.2. Hyperparameters

4.3. Results

4.4. Number of Topic Effects

4.5. Dropout Effects

4.6. Neural Topic Ensembel Model

5. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI