1. Introduction
In the internet era, a vast amount of new unstructured textual data becomes available every day. Analyzing these data and dividing the types of content into categories requires an automated approach to text classification. From the point of view of the known classification methods, the general text classification task can be divided into short and long text classification. The rationale for this division is that, since deep learning methods were introduced mostly in 2010–2020, these two subtasks have often required different language modeling approaches. Before 2010, little need for this distinction existed, because for classic methods—such as bag of words (BoW), term frequency, inverse document frequency and various lexicon-based methods, such as linguistic inquiry and word count [
1]—the computational cost of using each method did not substantially change with the length of the text instance (LTI), because all the methods were based on counting token occurrences (CTC).
However, when convolutional neural networks and recurrent neural networks were introduced to the domain of natural language processing (NLP), the situation changed. The computational costs of visiting each text fragment with convolutional filters in convolutional neural networks appeared to grow markedly faster with the increase in LTI, in contrast to CTC methods. Additionally, maintaining past tokens in memory of recurrent neural networks appeared to substantially increase the computational cost for longer token sequences.
Since 2017, when the attention mechanism was introduced [
2] and deployed in what are now called transformer models, the limit of LTI was proposed as 512 tokens to maintain reasonable computation times. This step was necessary because, for each token in the text instance, the original attention mechanism analyzed connections to all other tokens in the text instance, thus substantially increasing computational cost when larger LTI values were considered. To alleviate this issue, various modifications to the attention mechanism have been proposed in several studies [
3,
4,
5].
Consequently, when a long text instance is to be classified, a data scientist can choose among very efficient classic CTC methods or highly inefficient deep learning methods. Even if the latter are selected for their theoretically higher prediction quality, not everyone can afford a costly computational machine capable of performing the required deep learning operations. For example, a longformer [
6] model cannot always be deployed, and proper data set-dependent parameter optimization of the necessary fine-tuning procedure cannot always be performed. Without this step, the best results offered by the model cannot be acquired.
To overcome this limitation, several approaches have been proposed for decreasing the computational cost of deep learning text classification methods through introducing information loss at the beginning of the classification process. The general idea is simple: the original too long text instance must be truncated to fit the LTI limit of the language model selected for final analysis. If a model with an LTI limit of 512 tokens is selected, the most naive truncation method is to analyze only the first 512 tokens of the original text instance. Despite being naive, this truncation method is very common and is the default in, for example, the renowned NLP framework Flair [
7]. Another known approach [
8] has proposed to use the available analysis space of 512 tokens so that the next text instance constitutes part of the beginning and end of the original text instance. More complex deep learning approaches described by [
9,
10] have proposed an encoder-decoder model or another “judge” model to select relevant sentences for the final analysis. Very recently, [
11] proposed a simpler method called Text Guide, which is based on a CTC model that also allows important text fragments to be selected for text classification. However, the authors of Text Guide have indicated that the performance of their method could presumably be improved through additional experiments.
In this study, we aimed to explore selected possibilities for improving the results obtained with the Text Guide method. In particular, we wanted to answer the following research questions (RQ):
- RQ1
What is the influence of the CTC model quality on the Text Guide?
- RQ2
Does the quality of the Text Guide depend on the adopted machine learning interpretability method?
- RQ3
Is considering more than one occurrence of essential token found in the original text instance by Text Guide is beneficial?
- RQ4
Is Text Guide useful for text instances only slightly exceeding the model limit?
- RQ5
Does Text Guide obtain results with a selected transformer model transferable to other models?
- RQ6
Is use of Text Guide beneficial for other text classification data sets?
The reasons for formulating and approach to answering the above research questions were the following:
Regarding RQ1, Ref. [
11] has used a BoW model for obtaining token features. Here, we explore the influence of the quality of the model used for obtaining token features for final classification performance based on text instances, as created by Text Guide.
Considering RQ2, in [
11], only one tool for extracting the feature importance of selected important tokens was used. In this study, we tested whether a more recent approach called Shapley Additive Explanations (SHAP) [
12] might result in performance improvements.
Regarding RQ3, we observed that Text Guide searches the original text instance only for the first occurrence of a token previously identified as important; if found, Text Guide extracts the token together with its neighboring tokens and adds the resulting “pseudo sentence” to the new text instance. Here, we explored the performance of Text Guide when not only the first occurrence was considered.
Considering RQ4, based on [
11], we believed Text Guide’s performance for text instances exceeding the model limit to only a small extent required further exploration. For this purpose, we introduced a threshold parameter defining the instance length for which the Text Guide method should be used.
RQ5 was formulated to verify whether it can be assumed that since using Text Guide for one model is beneficial, it will also be for other models. The answer to this RQ was obtained by testing four other transformer models.
Finally, regarding RQ6, testing a method on various data sets can confirm its utility. In order to prove Text Guide can be helpful for other data sets, we apply the Text Guide to another well-known data set.
For all experiments, we used a publicly available Text Guide implementation [
13]. We believe the main contribution of our research is testing and optimizing various modifications and parameter values of the very recently introduced truncation method.
2. Methods
Methods of this article can be divided into ten sections.
2.1. Revisiting Text Guide
The original Text Guide method was proposed and described in detail in [
11]. Here, we provide a brief review as a context for our current experiments. Text Guide allows for non-naive truncation of text instances that exceed the length limit of a selected deep learning model, which is later used to create a vector representation of the text instance. As presented in
Figure 1, Text Guide requires initial input of a list of important tokens extracted from the original text instance and sorted according to their feature importance (sITFL), the original text instance, the desired length of the new instance measured in tokens (desired LTI) and a set of additional parameters.
In a preprocessing step, to create the sITFL, a simple CTC language model operating on token features such as BoW can be used. After a machine learning classifier is trained on the extracted features, the features need to be sorted according to their importance obtained from the classifier via a method from the explainable artificial intelligence (XAI) domain.
The desired LTI can be an arbitrary number but optimally it mimics the limit of the language model later used to create vector representations of text instances. For the models from the transformer family, the value most often used is 512 tokens, but models offering a higher 4096 token limit also exist [
6].
To create a new truncated text instance from the original one, Text Guide extracts fragments of text located around important tokens found in original text instances. Consequently, when a token from sITFL is found in the text instance, Text Guide extracts it together with its neighbors, and such “pseudo sentences” are used to create the new text instance. Additional parameters define, for example, the number of neighbor tokens to be extracted or whether an arbitrary part from the beginning and/or end of a text instance is to be extracted. In [
11], these parameters were subject to model and data set-specific optimization.
2.2. Experimenting with the Quality of the Model Responsible for Extracting Token Features
The token features extracted from the text instance in [
11] were obtained by training a simple BoW model on the text corpus. Here, we decided to assess the extent to which modifying the BoW model affected the overall text classification achieved in instances prepared by Text Guide. All BoW experiments were performed with the python scikit-learn CountVectorizer function [
14], the number of features was reduced according to the computed mutual information [
15] score, and 200, 4000 and 10,000 features output from the model were tested. The rationale for these choices was that when the lowest value of 200 was considered, the BoW model would presumably perform poorly, owing to the excessively limited number of features, whereas in the other two cases, this limitation would be alleviated. We expected to observe poorer Text Guide performance in the first test case and improved performance in the other two cases. In the experiments throughout this study, unless stated otherwise, BoW with 4000 features was used.
2.3. Testing Output from Other Model Interpretability Tools
In [
11], the sITFL was obtained by training a gradient boosting machine learning classifier (XGBoost) on the token features obtained from the BoW model and then extracting the feature importance directly from XGBoost. We sought to test whether using a more recent SHAP method from the XAI domain might improve the order of features in sIFTL as well as the overall classification performance based on text instances created by Text Guide. We used both the Kernel SHAP [
16] and Tree SHAP [
17] methods. Additionally, we computed results for randomly sorted sITFL for baseline comparison. In later experiments throughout this study, unless stated otherwise, XGBoost was used.
2.4. Using More than the First Token Occurrence
In the basic approach, when Text Guide is provided with the sITFL and the original text instance, it begins operation by searching in the original text instance for the first important token taken from the sITFL. If the important token is found, then a “pseudo sentence”—consisting of the important token and its neighbors—is extracted from the original text instance and used to create a fragment of the new text instance. Later, the next important token from the sITFL is analyzed. Consequently, if more occurrences of the first important token exist in the original text instance, they are neglected. In this study, we explored how the Text Guide performance is influenced by considering more occurrences. For our experiments, we selected 1, 2, 3, 4, 5 and all occurrences for testing.
2.5. Experiments with Text Instances That Are Not Extremely Long
The initial results obtained in [
11] with a fine-tuned Albert base v2 (AlBERT) model [
18] on the text instances truncated by Text Guide were slightly inferior to those obtained via the naive truncation method. An additional experiment was performed in which Text Guide was used only for instances that exceeded the model limit by a factor above 1.5. In this case, the fine-tuned AlBERT model yielded slightly improved results relative to those of the naive truncation method. However, the authors of the original Text Guide have stated that the threshold value of 1.5 was selected without any proper investigation and that this value should be explored in future research. Here, to respond to that call, we performed a more detailed analysis of the influence of this parameter, which we denote the over length threshold (OLT). We investigated OLT values of 1, 1.1, 1.2, 1.3, 1.4 and 1.5. For instances shorter than the Text Guide application threshold, we used a naive truncation method, which resulted in analysis of only the first tokens that fit the model limit. In later experiments throughout this study, unless stated otherwise, an OLT equal to 1 was used.
2.6. Data Sets
The choice of data sets used in experiments with Text Guide followed [
11], which used DMOZ [
19], a well-known data set also used by other researchers for benchmarking [
20] or web page classification with long text instances [
21,
22]. As in [
11], our study focused on the 30 most common classes in the DMOZ data set and divided the original data set into three data sets on the basis of text instance length. Additionally, in order to demonstrate the Text Guide performance on a different well-known data set, we utilized the Internet Movie Data Base Large Movie Review data set [
23], which defines a binary sentiment classification task. Specifically, this data set was analyzed in two settings: (1) ‘IMDB FULL’, i.e., the whole data set consisting of 49,582 unique data instances, and (2) ‘IMDB 510 1000′, i.e., the data set limited to 3609 unique instances with the length of over 510 tokens and below 1000 tokens. All the used data sets are described in
Table 1.
2.7. Models, Training Procedure and Metrics
To create vector representations of the text instances prepared by Text Guide for most of our experiments, we used AlBERT, a small transformer model enabling less computationally expensive experiments. To test the extent to which the obtained conclusions were transferable to other pre-trained models, for final comparison of Text Guide with naive truncation, we also used the (1) RoBERTa base (RoBERTa) model [
24], (2) Squeezebert uncased (SqueezeBERT) model [
25], (3) Distilbert base uncased (DistilBERT) model [
26] and (4) BERT base uncased (BERT) model [
27]. All models were downloaded from the Huggingface Transformers server [
28]. The rationale for model selection was to demonstrate the performance of models representing the state of the research transformer model architecture with a length limit of 512 tokens. Some selected models, such as AlBERT, DistilBERT or SqueezeBERT, have been demonstrated to provide high performance with limited computational cost. BERT was selected because it is the most renowned transformer baseline model, whereas RoBERTa was tested because it usually has the best possible prediction quality.
All compared models were used in their pre-trained versions to avoid bias from unoptimized fine-tuning procedures.
The instance embeddings created by the models were fed to the XGBoost classifier [
29]. All machine learning classification experiments were performed with a five-fold stratified cross validation procedure. In each fold, the training, validation, and testing sets differed, and classification metrics were computed after gathering the ground truth and predicted labels for testing instances from all five folds.
Unless stated otherwise, experiments were performed with Text Guide parameters optimized for the pre-trained AlBERT model adopted from [
11], as follows: (1) for all data sets 0.1 × 510 = 51 tokens were selected from the very beginning of the original text instance to form the beginning of the new text instance, (2) TN was set to two for the DMOZ 510–1500+ data set and to three for the other data sets, and (3) the end of the original text instance was not specifically considered.
For all experiments, we used the Matthews correlation coefficient introduced in [
30] as the decisive quality measure of predictions. As demonstrated in [
31], the Matthews correlation coefficient is superior to other metrics when imbalanced data sets with numerous classes are analyzed, as in our DMOZ derived 30-class data sets. However, for the experiments carried out on IMDB data sets where a binary sentiment classification task was addressed, we also include more popular metrics such as F1 micro score and receiver operating characteristic area under curve (ROC AUC) score [
32].
2.8. Computing Machine and Software
The in silico experiments were conducted on a single computing machine equipped with a 16-core CPU, 64 GB RAM, and Titan X 24 GB RAM GPU. The software for the experiments was implemented in Python3 with publicly available packages including Flair [
7], XGBoost [
29], Transformers [
28], Scikit-learn [
33].
2.9. Experimental Procedure
To answer the research questions described in the introduction, the necessary experiments were carried out according to the order demonstrated in
Figure 2. Following the presented procedure firstly allowed the parameters of the Text Guide to be defined, and secondly allowed them to be used for final experiments with various models and data sets.
2.10. Statistical Analysis of Results
The change in truncation method does not necessarily provide large differences in data set-wide results, for instance, due to a small fraction of text instances exceeding the adopted length limit in the data set. Therefore, we conducted the statistical bootstrap analysis to resolve uncertainties regarding the significance of differences stemming from this fact in the final comparison with the IMDB data sets. The analysis focused on the decisive MCC metric. This analysis was conducted for each transformer model separately to compare the significance of differences between naive truncation and Text Guide for each model for each IMDB data set. The following procedure was adopted: (1) we hypothesized that the model results obtained with the Text Guide would not be significantly better than those of naive truncation, (2) we bootstrapped the distribution of the metric value for the model trained with naive truncation method. In all cases, we resampled and computed the metric value 10,000 times, and (3) we computed the statistical significance of differences from the metric value of the model utilizing the Text Guide truncation method. The resulting p-values were used to accept or reject the stated hypothesis as follows: (1) p-value greater than 0.1 indicates no statistically significant differences between models, and only in this case the hypothesis was accepted, (2) p-value between 0.1 and 0.01 (marked with ‘*’ sign) indicates a weak significant difference, (3) p-value between 0.01 and 0.001 (‘**’) indicates a significant difference, and (4) p-value lower than 0.001 (‘***’) denotes highly significant difference between models.
4. Conclusions
In this study, we explored selected means of optimizing and testing the Text Guide method introduced in earlier research. Six research questions were defined and answered, which allowed us to propose recommendations regarding improved default parameters for Text Guide and confirmed the dependency of the final classification performance from selected methods used by Text Guide. Our research formulated clear and practical contributions for future users of Text Guide. In particular, we found that the quality of the CTC model and XAI tool, both of which are responsible for extracting relevant tokens and their neighbors from the original text instance in the correct order, influence Text Guide performance. Our findings also demonstrated that Text Guide could provide superior performance to that of the naive truncation method for several recent transformer models and for well-known DMOZ and IMDB large movie review data sets.
However, we also note a limitation of our study and Text Guide method, i.e., that model and data set-specific parameter optimization is recommended before Text Guide is applied. This may be important and challenging, especially from the computational point of view, with future research focusing on using Text Guide together with fine-tuned transformer models.