1. Introduction
Artificial Intelligence (AI) models, including Neural Networks and Support Vector Machines, are often considered black boxes due to their lack of interpretability, which raises reliability concerns. A key challenge is their tendency to learn spurious correlations—patterns that seem predictive but fail under distribution shifts [
1]. These correlations frequently stem from dataset artifacts, such as selection bias or intentionally embedded manipulations known as backdoors [
2]. Additionally, confounders—unintended variables that distort the relationship between features and output classes—can further mislead models, leading to systematic errors and reduced generalization [
3]. In Natural Language Processing (NLP), models often depend on superficial patterns rather than genuine linguistic structures. This phenomenon, known as shortcuts, enables high accuracy during training but leads to poor generalization [
4]. A well-known example is the Clever Hans effect, where models exploit irrelevant cues instead of learning the actual task [
5]. Similarly, backdoors refer to deceptive biases—which may be intentionally inserted—that models leverage to manipulate predictions [
6].
Wang et al. [
4,
7] reported an illustrative case where a sentiment classifier incorrectly associates the word “spielberg” with positive reviews. Instead of evaluating sentiment, the model learns that reviews mentioning “Spielberg” are frequently positive in the training set, leading to misclassification when the word is absent. This example highlights how models capture non-causal relationships, resulting in unreliable predictions.
Researchers have developed Explainable Artificial Intelligence (XAI) techniques to improve model interpretability and expose biases [
8]. While not explicitly designed to detect spurious correlations, XAI methods help identify artifacts by highlighting highly weighted features lacking causal relevance [
9,
10,
11]. Prior research has shown that spurious patterns are a significant source of model errors, mainly when data labels contain inconsistencies [
12,
13].
This work presents a method that combines XAI-based feature importance with unsupervised learning to cluster patterns logically and interpretably, enabling a structured analysis of their association with output classes. Our heuristic uses statistical metrics from model–data interactions to catalog misclassified instances, extract frequent words from similar sentences, and assess their impact. It perturbs the data by removing these words from test samples and evaluates changes in sensitivity and specificity. Unsupervised learning then clusters the detected patterns, uncovering their relationships and likelihood of being spurious. To validate our approach, we apply it to datasets of Public Procurement Contracts and Bidding Processes, which present challenges due to their formal structure and domain-specific vocabulary. We also conduct experiments on IMDB, assessing the method’s generalizability and comparing it with existing techniques.
Unlike previous approaches, which focus on individual words, our method detects multi-word patterns, capturing more complex spurious correlations. The novelty of our approach lies in three key aspects: it operates without human intervention, enabling the fully automated detection of spurious patterns; it introduces the distance to cluster centroids as a metric to gradually quantify spuriousness instead of a binary classification; and it identifies compound patterns, allowing for a more nuanced analysis of spurious dependencies. The main contributions of this work are as follows.
An automated, unsupervised approach that eliminates the need for the manual identification of candidate spurious patterns.
The clustering of spurious patterns into interpretable groups, introducing distance to cluster centroids as a metric for measuring spuriousness.
Real-world datasets of public procurement contracts for binary classification and spurious correlation analysis, available at the mentioned GitHub link.
These contributions set our method apart from existing techniques, providing a more comprehensive understanding of spurious correlations in text classification.
Although our focus is on detection, several techniques exist to mitigate spurious correlations, including statistical reweighting, regularization, and counterfactual generation [
14,
15]. By identifying spurious patterns, our approach lays the foundation for the application of such corrective strategies.
2. Related Work
Recent research explores methods to detect and mitigate spurious correlations in AI models, particularly in classification tasks. This section categorizes the key approaches into four distinct groups.
2.1. Counterfactual Generation
Counterfactual generation modifies input instances to determine whether a model depends on spurious correlations or meaningful patterns. Researchers apply this technique using different methodologies.
Wang and Culotta [
4] generate counterfactuals by replacing specific words in the input. Their method uses statistical techniques to identify words that are highly correlated with model predictions and assesses their impact on classification. However, this approach focuses solely on surface-level modifications and does not ensure semantic coherence. Yadav et al. [
16] introduce a Tsetlin Machine-based approach to generate counterfactual rules. Instead of modifying individual instances, their method learns logical patterns that differentiate classes, enabling counterfactual generation with structured regulations. However, this approach faces challenges in generalizing learned patterns to complex contexts. Liu et al. [
17] examine whether deep learning models perform logical reasoning or rely on statistical correlations. They construct a counterfactual dataset to test whether models maintain logical consistency when data structures change. Their results show that models exploit spurious patterns in training data and fail under systematic modifications. Veitch et al. [
15] analyze counterfactual generation from a causal inference perspective. They argue that generating counterfactuals without considering causal structures may introduce new biases. They propose conditional independence-based regularization to approximate counterfactual invariance, demonstrating that different causal relationships require distinct mitigation techniques.
Counterfactual generation provides a method for assessing and improving model robustness. However, the reviewed studies highlight the need to ensure that counterfactuals maintain semantic validity and that modifications influence predictions meaningfully. Selecting a counterfactual generation method requires considering the causal structure of the data to avoid introducing artificial patterns that may induce biases.
2.2. Data Perturbation
Data perturbation alters textual inputs to assess model robustness. This technique introduces controlled modifications, such as word replacements and removals, syntax changes, and adversarial noise, to determine if a model relies on meaningful patterns or superficial statistical artifacts. Unlike counterfactual generation, which aims to create semantically valid alternative samples, data perturbation focuses on testing model sensitivity by systematically distorting input structures.
Wang and Culotta [
7] employ perturbation techniques to replace words strongly correlated with the target label, evaluating the model’s reliance on specific lexical features. By altering these words, they assess whether models depend on linguistic cues or learn shallow correlations. Their findings reveal that simple word substitutions significantly affect model predictions, highlighting the presence of spurious dependencies. Wu et al. [
14] propose an automated framework to generate and filter training samples, aiming to mitigate biases by removing instances that reinforce spurious correlations. Their method perturbs text by modifying task-independent features and evaluates the impact using z-statistics. The filtering process eliminates samples that exhibit strong correlations with non-task-related features, resulting in datasets that promote more generalizable models. However, this approach depends on prior knowledge of biased attributes, which may not be available in all domains. Zhang et al. [
18] systematically analyze the effects of different perturbation strategies on neural NLP models. They introduce a learnability metric to measure how well a model adapts to specific perturbations. Their experiments show that models sensitive to perturbations—such as word scrambling, character alterations, and punctuation manipulations—tend to exhibit lower robustness. This suggests that models overfitting to dataset artifacts struggle to generalize to unseen variations. While their approach provides valuable insights, the computational cost of measuring learnability may limit its practical adoption.
Although counterfactual generation can complement perturbation-based approaches by introducing semantically coherent modifications, data perturbation primarily aims to stress-test models by introducing structured distortions. The reviewed studies demonstrate that perturbation techniques effectively uncover weaknesses in NLP models, though challenges remain in selecting meaningful perturbations and ensuring computational efficiency.
2.3. Explainable AI Techniques for Detecting Spurious Correlations
Explainable Artificial Intelligence (XAI) encompasses methods designed to make machine learning models more transparent by providing insights into their decision-making processes. While the primary goal of XAI is to enhance interpretability, several studies demonstrate that these techniques can also help detect spurious correlations by highlighting influential patterns in model predictions.
Two widely used explanation techniques are Local Interpretable Model-Agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP). LIME approximates a black-box model by locally fitting an interpretable model, such as linear regression, to perturbed input versions. It assigns feature importance scores based on how modifications to input variables affect predictions [
19]. Based on cooperative game theory, SHAP computes Shapley values to quantify the marginal contribution of each feature to a model’s prediction, ensuring a consistent and theoretically grounded attribution method [
20]. Cardozo et al. [
21] introduce Explainer Divergence Scores (EDSs) to measure how different explanation methods reveal dependencies on spurious patterns. Their findings suggest that while LIME and SHAP provide meaningful insights, they often highlight spurious features without explicitly distinguishing them from genuine predictive patterns. Soares et al. [
22] propose combining XAI with human evaluation to filter misleading patterns in textual datasets. Their results indicate that LIME effectively pinpoints spurious dependencies by assigning high importance to features frequently associated with incorrect predictions. SHAP, in contrast, offers a more stable attribution across multiple samples, making it useful for detecting persistent spurious correlations.
However, challenges remain in distinguishing causal from non-causal dependencies. Chou et al. [
11] argue that explanations might reinforce biases if not appropriately validated. They note that feature attribution methods frequently emphasize co-occurring patterns, which may not always reflect authentic causal relationships.
2.4. Other Techniques: Statistical Reweighting and Regularization Methods
In addition to the techniques presented, new strategies address spurious correlations in text classification. Two approaches are statistical reweighting and regularization methods, which aim to reduce the model’s dependence on misleading patterns by adjusting the training data distributions and modifying the learning process, respectively.
Statistical reweighting modifies the weight of training instances to counteract dataset biases. The technique adjusts sample importance to prevent spurious correlations from dominating model learning. Serrano et al. [
23] formulate an optimization-based reweighting method that redistributes instance weights to mitigate lexical biases. Their study applies the method to natural language inference and duplicate-question detection tasks, where individual word occurrences should not influence predictions. Statistical reweighting simultaneously reduces thousands of spurious correlations and aligns label distributions more uniformly across multiple lexical features. Despite reducing biases at the data level, trained models retain significant biases. While unigram biases decrease, associations between more complex patterns (e.g., bigrams) intensify, showing that reweighting alone does not eliminate spurious dependencies. The study demonstrates that mitigating biases in NLP requires modifications beyond data reweighting, including adjustments to model architectures and learning strategies to address deeper correlations. Regularization techniques adjust the training process to prevent models from depending on spurious patterns. Chew et al. [
5] propose a family of regularization methods called NFL (doN’t Forget your Language), which limits changes in parameter updates to reduce overfitting to spurious correlations. Their approach applies neighborhood analysis to show how spurious correlations cluster unrelated words in the embedding space, causing models to misattribute importance to non-causal tokens. NFL addresses this issue through different constraints. NFL-F (frozen) preserves original word representations by freezing pre-trained model parameters and fine-tuning only the classification head. NFL-CO (Constrained Outputs) maintains pre-trained semantic relationships by introducing a loss term that minimizes cosine distance between token representations before and after fine-tuning. NFL-CP (Constrained Parameters) prevents drastic shifts in learned representations by penalizing excessive parameter updates. NFL-PT (prompt-tuning) enables adaptation without modifying learned embeddings by using continuous prompts instead of fine-tuning the entire model. Their results show that NFL significantly improves model robustness against spurious correlations without requiring external datasets. However, they also note that no single regularization strategy is universally effective, and each variation balances generalization and robustness differently.
3. Theoretical Framework
3.1. Natural Language Processing (NLP)
Natural Language Processing (NLP) focuses on the interaction between computers and human language, particularly in text classification tasks. NLP models apply algorithms that assign texts to categories, aiming to maximize the probability of a correct prediction, defined as
, where
x is the input sentence,
y is the predicted output, and
represents the scoring function based on the parameters
[
24].
This study employs three models for binary classification. The first model is Logistic Regression (LRG), a linear classifier widely used in text classification due to its scalability and interpretability, which allow direct feature importance analysis [
25]. However, its linear nature restricts its ability to capture complex word dependencies. LRG was selected for its intrinsic explainability, which aligns with our model-agnostic approach as a baseline for comparing the performance of more complex models.
The second model is a Support Vector Machine (SVM) optimized with Stochastic Gradient Descent (SGD), which is a linear classifier that maximizes the margin between decision boundaries by optimizing a hinge loss function [
26]. Unlike traditional kernel-based SVMs, SVMs with SGD prioritize efficiency over non-linearity, making it a practical choice for large-scale text datasets. SVM was selected due to its efficiency in handling high-dimensional text data. Prior studies [
22] demonstrate that this model performs well in detecting spurious correlations in datasets similar to those used in this research.
BERT (Bidirectional Encoder Representation from Transformers) is a transformer-based architecture that processes textual data using self-attention mechanisms [
27]. It captures bidirectional dependencies, providing a deeper contextual understanding of words and sentences. Pretrained on large corpora using masked language modeling, BERT undergoes fine-tuning to adapt to specific classification tasks. This study uses BERT in two roles: as a Word Embedding extractor for the linear models and as a binary classifier. By generating dense, contextually rich representations, BERT enhances the ability of linear models to identify relevant patterns. As a classifier, it allows an analysis of its sensitivity to spurious patterns compared to linear models. Additionally, its compatibility with the LIME post hoc explainability technique facilitates model interpretation, which is essential for this research.
Term Frequency–Inverse Document Frequency (TFIDF) represents text numerically by weighting term occurrences based on their relevance across a document set [
28]. TF measures how often a word appears in a document, while IDF adjusts this frequency according to how commonly the word appears in the entire corpus. This study selects TFIDF because it enables linear models to extract relevant patterns, and prior research [
22] showed its effectiveness in detecting spurious correlations in a similar dataset for this study.
In this study, we generate Word Embeddings (WEs) using the BERTimbau model via the Sentence-BERT (
SBERT) framework, which optimizes BERT for sentence-level embeddings by transforming each sentence into a dense vector representation. The SVM and LRG classifiers use these embeddings as input features. Since these models do not fine-tune embeddings, we keep them fixed during training. These embeddings encode semantic relationships, allowing linear models to make predictions based on richer text representations while maintaining interpretability [
29].
3.2. Explainable Artificial Intelligence (XAI)
Explainable Artificial Intelligence (XAI) techniques enhance the interpretability of AI models, making their predictions more transparent [
8]. Although not their primary purpose, these techniques help identify spurious patterns and causal relationships [
10,
30]. XAI approaches can be local, such as Local Interpretable Model-Agnostic Explanation (LIME), which explains individual predictions by generating perturbed versions of an input instance and fitting a simple interpretable model to approximate the decision boundary of the complex model in that local region. Conversely, global methods, such as SHapley Additive exPlanations (SHAP), quantify feature importance across the entire model by computing Shapley values, which estimate each feature’s marginal contribution based on cooperative game theory [
20].
3.3. Unsupervised Learning
Unsupervised learning is a machine learning approach that analyzes unlabeled data to identify inherent patterns and structures without predefined categories; unlike supervised learning, which relies on labeled datasets, unsupervised learning uncovers relationships and clusters within the data based on similarity metrics [
31]. This study applies K-means, the Elbow Method, and Principal Component Analysis (PCA) to explore the statistical properties of datasets and create logical and interpretable clusters. PCA reduces dimensionality by transforming correlated variables into uncorrelated principal components, preserving most variance, simplifying complex data analysis, and aiding in pattern identification [
32]. The Elbow Method determines the optimal
k value by evaluating the Within-Cluster Sum of Squares (WCSS) reduction to identify the balance between accuracy and complexity [
33]. K-means partitions data into
k clusters, minimizing the sum of squared distances between points and their centroids.
This study applies K-means to segment spurious patterns into clusters, providing a structured way to analyze their distribution. We choose K-means based on its ability to assign patterns to clusters using centroids that summarize the main characteristics of each group [
34]. Furthermore, this study hypothesizes that patterns further from the centroid may exhibit properties associated with spurious correlations as they deviate more from the core structure of the data. Similarly, the study explores whether outliers in the clustering process correspond to patterns significantly influencing model errors.
4. Method
The proposed method organizes the process into a pipeline comprising seven stages—Preprocessing, Train/Test Model, Extract Important Words, Analyze Errors, Extract Investigation Patterns, Extract Potential Spurious Patterns, and Cluster Patterns—as illustrated in
Figure 1. Overlapping lines in the stages indicate cross-validation (
) to evaluate the pipeline in different contexts. Each stage applies techniques based on the specific model, ensuring the approach remains model-agnostic. The concepts and notations follow the conventions defined by [
4,
35].
4.1. Datasets
We used datasets specific to the context of the TCE-PI, named (i) Contracts Dataset (C) and (ii) Bidding Dataset (L).
The Contracts Dataset [
36] supports the development of an automatic classification model for public administration contracts, focusing on COVID-19-related expenditures. A team of 12 TCE-PI specialists manually labeled the dataset by analyzing contracts published in the Official Gazettes of Piauí between March and September 2020. Each specialist labeled a specific set of contract descriptions, classifying them into healthcare-specific acquisitions (label 1) and other procurements (label 0). To improve consistency, another auditor reviewed a sample of the annotations. When auditors disagreed, the chief auditor made the final decision, ensuring consensus in contentious cases. This structured process—combining individual annotations, cross-reviews, and expert arbitration—ensured a reliable and well-curated dataset.
For this study, the team expanded it with additional contract descriptions from Sistema Contratos—Web (
https://sistemas.tce.pi.gov.br/contratosweb/, accessed on 19 March 2025), a TCE-PI platform where jurisdictional entities register contracts as the law requires. The specialists applied the same labeling process to these new entries, adding 1727 sentences.
Table 1 summarizes the dataset quantities before and after the expansion.
The Bidding Datasets are being developed to evaluate an architectural approach that identifies signs of fraud in public procurement, as described in [
37]. It consists of procurement notices published between 2012 and 2023, recorded in the Licitações-Web (
https://sistemas.tce.pi.gov.br/licitacoesweb/, accessed on 19 March 2025) system of TCE-PI. The system allows jurisdictional entities to register their contracts as required by current legislation. The data extraction process used regular expressions to isolate the section describing the procured object. To verify the accuracy of the extraction, we trained a BERT model specifically for this task. A panel of three public procurement specialists manually labeled the data into four main categories, as described in [
38].
Table 2 shows the quantities of labeled data for each class at the time of dataset release for our research:
For this study, we expanded the original dataset using Active Learning [
39] with specific adaptations for binary classification. We isolated sentences describing the procurement object. We adapted the classes by grouping “Engineering Works Contracting Services” and “General Services Contracting” as class 0 and “Acquisition of Permanent Goods” and “Acquisition of Consumable Goods” as class 1. We used the labeled dataset to train a BERT model, applying it to 200 additional sentences, which specialists reviewed and corrected. This cycle of classification and review continued until the quantities shown in
Table 3 were reached.
We define the dataset as follows. Let be the set of words and symbols in a language, and let be the set of valid sentences, where represents the set of all finite non-empty sequences of elements from W. The dataset is partitioned into disjoint training and testing sets, and , ensuring . The classification task is defined over the label set , where each sentence is assigned a unique label. The training and testing datasets are then given by and , where each sentence or is uniquely associated with a single label y.
4.2. Preprocessing
For the LRG and SVM models, we applied text cleaning, tokenization, stop word removal, and TFIDF vectorization. For the BERT model, we performed subword tokenization using WordPiece, added unique tokens ([CLS] and [SEP]), applied padding to standardize text length, and used segment embeddings to differentiate text segments.
4.3. Train/Test the Model
We structured the training and testing process by implementing a k-fold cross-validation with
k = 5, generating five independent folds. These folds functioned as fixed partitions of the dataset, ensuring that each instance participated in training and testing in different contexts. This approach was applied at all stages of the method as per the overlapping lines shown in
Figure 1.
For the LRG and SVM models, we optimized hyperparameters using Bayesian optimization [
40], which was performed exclusively on the training set of each fold. Within each fold, we conducted an additional
internal cross-validation using only the training data to guide the search for optimal hyperparameters. After selecting the best configuration, we trained the model on the entire training portion of the fold and evaluated it on its corresponding test set, ensuring a robust and unbiased assessment. For the BERT model, we applied fine-tuning, regularization, dropout, checkpoints, early stopping, pruning, and padding to enhance performance.
We define the model trained on the training data as , where converts a sentence s into a vector representation x.
For the LRG and SVM models, g applies two distinct text representations: TFIDF and Word Embeddings. When using TFIDF, g transforms a sentence into a sparse vector based on term frequency-inverse document frequency weighting. When using Word Embeddings, g tokenizes the text and processes it through a pre-trained BERT model, generating dense vector representations that encode contextual information. These embeddings remain fixed during training and serve as input for the linear classifiers. The classification function remains a linear mapping, learning decision boundaries in both TFIDF and Word Embedding spaces. The training process optimizes f while keeping g fixed, ensuring that the classifiers leverage lexical and semantic information without modifying the underlying embeddings.
For the BERT model, g applies the BERT tokenizer and processes the tokenized text through transformer layers to generate contextual embeddings. Unlike linear models, where embeddings are pre-computed and static, BERT fine-tunes its parameters during training. The classification function in BERT consists of a dense network applied to the [CLS] token embedding, using sigmoid or softmax activation for classification. In this case, both f and g are optimized during training, allowing BERT to refine its representations based on task-specific data.
Another task in this step catalogs model errors by defining
as the set of sentence-class tuples from the test set where the model made incorrect predictions, as shown in the following notation:
Here, represents the error sentence, is its actual class, and is the model’s prediction. This set includes only the tuples where differs from the prediction.
4.4. Analyze Model Errors
The error analysis identifies patterns that contribute to incorrect model predictions. We examine error sentences (
) and similar sentences from the training set (
). We divide similar sentences into two categories: those with the same class label as
(similar sentences from the same class) and those with the opposite class label (similar sentences from the opposite class). These sentences influence the model’s incorrect predictions [
12,
13].
Figure 2 illustrates this relationship, where
is the error sentence,
represents a similar sentence in the training set, and
indicates the similarity value between
and
. Similar sentences from the same class as
are shown in orange, while sentences from the opposite class are in blue.
For the SVM and LRG models, we used cosine similarity with TFIDF and Word Embeddings. For the BERT model, we applied BERTScore [
41], which incorporates the context of words through contextualized embeddings. The final result produces a structure
with two dimensions corresponding to classes 0 and 1, as shown in the following notation:
Here,
represents the error sentence from class
y;
denotes the set of sentences similar to
within the same class:
while
represents the set of sentences similar to
from the opposite class:
The function measures the similarity between and , and represents the minimum similarity threshold.
4.5. Extracting Important Words
This phase aims to identify the most relevant words per class and their corresponding weights. We define specific requirements for the explainers, as shown in the following notation:
Here,
represents the explainer that, given a model
and a sentence
s, generates sets of ordered pairs
. Each pair
consists of a token
and a weight
, where
reflects the importance of
for class
y, providing insights into how
makes predictions. For intrinsically explainable models,
refers to the model’s internal explanation mechanism; for example, in Logistic Regression models, the coefficients (weights) indicate the strength and direction of the association between a feature and the prediction [
42]. For a sentence
s, the process generates two sets:
, which contains the most relevant tokens and their weights for class 0, and
, which is the same for class 1.
To achieve this phase’s goal, we calculate the global importance of each token. We define the global explanation as follows:
Here, represents the global weight of the token for class y. This process produces two sets, and , which contain the most relevant tokens and their weights for classes 0 and 1, respectively.
4.6. Extracting Investigation Patterns
In this study, we define investigation patterns as word combinations that frequently appear in error sentences and are considered important for the opposite class. If a pattern often occurs in sentences similar to the error sentence but belonging to the opposite class, we interpret it as the model possibly relying on it as a shortcut. Conversely, if a pattern appears in sentences similar to the error sentence within the same class, we interpret it as the model possibly using this pattern as a confounding factor.
We define the investigation patterns
P as word combinations as follows: (i) they appear in error sentences of a specific class and in similar sentences from the opposite class; (ii) they consist of the
m most important words for the class; and (iii) they are present in both error sentences and similar sentences from the same class and the opposite class. The formal definition of investigation patterns is given by
where
c represents a word combination
,
indicates that each word in the pattern is among the
m most important words for class
y,
means that the error sentence
includes the pattern, and
indicates that
c must appear in similar sentences from both the same class (
) and the opposite class (
).
This definition identifies patterns that may contribute to confusion between classes and considers them as potential artifacts or confounders. To measure the spuriousness degree of each pattern, in later steps, we calculate the frequency metrics of these patterns in different contexts: (i) error sentences in the same class (
) and in the opposite class (
); (ii) similar sentences in the same class (
) and in the opposite class (
). We also calculate the ratio metrics (
and
), as shown in
Table 4.
4.7. Extracting Potential Spurious Patterns
This step identifies potential spurious patterns by analyzing how removing the patterns selected in the previous step affects specificity and sensitivity metrics. The process perturbs the set by removing each pattern p and then evaluates the impact of this removal on the predictions of .
Initially, we calculate an initial set m of metrics, including specificity (), sensitivity (), true positives (), and true negatives (), based on the evaluation of on . Then, for each class y and for each pattern , we create a perturbed version by removing p from . The model is then evaluated on , producing a new set of metrics . We represent the difference between the original metrics m and the perturbed metrics as .
If
indicates an improvement in specificity or sensitivity (
or
), we consider the pattern
p potentially associated with a spurious correlation. In this case, we evaluate the impact of the pattern using the following metrics: the number of perturbed sentences (
), the perturbation rate (
), the number of correct predictions (
), and the accuracy rate (
), as shown in
Table 4. The result of this step is a set of patterns
p accompanied by the calculated metrics (
,
,
,
). We integrate these metrics with those from the previous step to determine the spuriousness degree of each pattern, as explained below.
4.8. Clustering Patterns
In this step, we apply unsupervised learning to group patterns into logical and interpretable clusters, identifying latent structures and facilitating the visualization of data distribution. We consider the distance of patterns to centroids as the main factor for cluster interpretation [
34]. We hypothesize that patterns farther from centroids (potential outliers) are more likely to be spurious.
We define an algorithm composed of steps to perform clustering and organize the visualization of clusters. Each step of this algorithm can be applied with different unsupervised learning techniques, providing flexibility and independence from specific methods. The input metrics for the algorithm are derived from the concatenation of the metrics for common patterns selected in the steps described in
Section 4.5,
Section 4.6 and
Section 4.7.
Table 4 summarizes the metrics used in cluster formation, where
p represents the pattern,
c represents the class for which it is considered important, and the metrics include importance weights, frequencies, ratios between frequencies in different contexts (error sentences and similar sentences), and perturbation quantities and rates. The steps of the algorithm are normalization, dimensionality reduction, determining the number of clusters, clustering, and visualization.
In this study, we applied Standard Normalization to standardize variables with different scales, ensuring comparability while also being sensitive to outliers [
43]. For dimensionality reduction, we used PCA due to its ability to preserve most of the variance in the data. We used the elbow method to determine the number of clusters, clearly balancing granularity and cohesion. We selected K-means because it aligns with hypotheses related to the distance of patterns to centroids [
34] and has been applied to identify patterns in datasets [
44]. For visualization, we used 3D plots to highlight the overall shape of clusters and outliers and bar charts to emphasize the distance of patterns from centroids.
This algorithm analyzes hidden structures without requiring labels and highlights correlations among the input metrics. As discussed in later sections, the approach enables interactive visualization and uncovers unexpected associations.
5. Model Configuration
This section presents the setup of the models described previously in
Section 3. We employ Logistic Regression (LRG), Support Vector Machine with Stochastic Gradient Descent (SVM-SGD), and BERTimbau (BERT), as described below.
5.1. Text Representation
The models rely on two types of text representations: Term Frequency–Inverse Document Frequency (TF-IDF) and Word Embeddings (WE). To generate WE, the BERTimbau model, integrated with the Sentence-BERT (SBERT) framework, encodes each sentence into a dense vector optimized for sentence-level tasks. The embeddings remain fixed during training in the linear classifiers, serving as static input features. Since LRG and SVM do not adjust embeddings dynamically, they operate directly on these precomputed vectors. In contrast, BERT processes tokenized text and updates its embeddings through fine-tuning.
5.2. Linear Classifiers: LRG and SVM
The evaluation of LRG and SVM considers both TF-IDF and WE representations. The Bayesian Search algorithm optimizes hyperparameters, selecting the best values for
C in LRG and
in SVM. A five-fold cross-validation strategy ensures robustness in model selection.
Table 5 summarizes the configurations applied to these models.
The training process for LRG includes L2 regularization with a maximum of 1000 iterations. The SVM model applies a stochastic gradient descent (SGD) optimizer, ensuring efficient parameter updates.
5.3. Transformer-Based Classifier: BERT
The classification model fine-tunes BERTimbau for binary classification. Each sentence is split into smaller units (subword tokens), marked with special indicators ([CLS] and [SEP]), adjusted for length, and converted into numbers that the model can process. The model processes the tokenized input through transformer layers, extracting contextualized embeddings that capture semantic relationships between words. A fully connected classification layer receives these embeddings and applies a softmax activation function to compute the probability distribution over the two target classes. The model assigns the final class label based on the highest probability.
The training process optimizes model weights using the Adam optimizer with a learning rate of 2
. The training runs for three epochs with a batch size of 4. The training includes dropout regularization (0.1), early stopping, and model checkpointing to improve generalization and prevent overfitting. The model selection criterion prioritizes the highest F1-score during validation.
Table 6 presents the BERT configuration details.
We executed the experiments in Python 3.9 using the Scikit-Learn 1.4.1, Torch 2.1.1, and Transformers 4.35.2 libraries, along with the pre-trained BERTimbau model [
27].
6. Experiments and Results
This section presents the results of our analyses to evaluate the proposed method. We conducted experiments using 20 combinations of models, text representation methods (TF-IDF and WE), datasets (Contracts and Bidding), and classes (0 and 1). Each combination followed the configurations detailed in
Section 5 to ensure consistency in the evaluation, but we only detailed one cluster-interpreting scenario. This choice serves two purposes: (i) presenting all combinations would unnecessarily lengthen the document and introduce redundancy, reducing clarity; (ii) the selected scenario represents the general trends observed across all results and effectively illustrates the study’s hypotheses and objectives.
The preprocessing and Train/Test Model steps revealed direct relationships between the dataset characteristics, text representation, and model performance.
Table 7 summarizes the results obtained for the Contracts and Bidding datasets.
In the Contracts dataset, which exhibits lower lexical diversity (5662 distinct words) and shorter sentences (an average of 30.5 words), the performance of the models varied depending on the text representation. When using TFIDF, the linear models (LRG, SVM) achieved high accuracies of 94.58% and 94.20%, with sensitivities of 94.24% and 94.34% and specificities of 94.93% and 94.90%, respectively. However, with Word Embeddings, both models experienced a drop in performance: LRG reached 84.23% accuracy, 83.19% sensitivity, and 85.27% specificity, while SVM obtained 86.03% accuracy, 81.48% sensitivity, and 90.58% specificity. The BERT model, which inherently uses embeddings, maintained a strong performance, achieving 94.97% accuracy, 95.39% sensitivity, and 94.93% specificity. We observed a similar trend in the Bidding dataset, which has higher lexical diversity (7960 distinct words) and longer sentences (an average of 43.95 words). With TFIDF, the linear models performed well, reaching 97.92% (LRG) and 97.52% (SVM) accuracy, with sensitivities of 97.88% and 97.52% and specificities of 97.96% and 97.54%, respectively. However, with Word Embeddings, their performance declined: LRG achieved 93.16% accuracy, 93.71% sensitivity, and 92.63% specificity, while SVM obtained 93.80% accuracy, 94.36% sensitivity, and 93.26% specificity. The BERT model, leveraging its native embeddings, continued to perform well, with 97.58% accuracy, 97.61% sensitivity, and 97.55% specificity.
BERT effectively handles data with high lexical diversity and complex textual structures, maintaining strong performance across both datasets. LRG and SVM perform well with TFIDF in datasets with constrained vocabulary and shorter sentences but decline with Word Embeddings, struggling to fully exploit their richer semantic information.
The model error analysis highlighted specific patterns that contributed to misclassifications.
Table 8 presents the most representative sentence for each model and text representation in the Contracts dataset, influencing errors in class Other procurements (0). The method identifies the representative sentence as the most frequently occurring similar sentence in the opposite class. The SVM and LRG models with Word Embeddings tended to confuse sentences related to procurement. In particular, the SVM model frequently misclassified sentences *, suggesting that the embedding-based representation captured semantic relationships between different procurement categories but failed to correctly distinguish between them.
The BERT model struggled with sentences related to service contracting. The most representative error sentence † indicates that the model may confuse similar contracts involving different types of products and services. On the other hand, the SVM and LRG models with TFIDF exhibited errors associated with frequent words. For instance, the LRG model had difficulty with sentences such as ¶, suggesting that highly recurrent terms like “contrato” (contract), “aquisição” (acquisition), and “municipal” (municipal) led to an overestimation of their importance.
These findings reinforce the need for further steps in the pipeline to mitigate these issues, particularly by identifying and addressing spurious correlations that mislead the models.
In the “Extract Important Words” step, we applied LIME for SVM (using TFIDF and Word Embeddings) and for BERT, given its strong performance in similar datasets [
22]. For LRG, we used intrinsic coefficients to assess word importance. The analysis revealed differences in word importance assignment across models with distinct text representations. Figures illustrate how importance and frequency correlate for the top 20 words in SVM with TF-IDF and Word Embeddings, focusing on the “Other procurements” (0) class of the Contracts dataset (C).
SVM with TFIDF (
Figure 3a) prioritized frequent words like “contratacao” (contracting) and “servicos” (services), highlighting their reliance on word frequency over semantic meaning. With Word Embeddings (
Figure 3b), shifted slightly, incorporating less frequent but contextually relevant terms like “social” (social) and “limpeza” (cleaning), which, despite being less frequent, remain closely related to public procurement contexts, though frequency bias remained dominant, the LRG model exhibited a similar tendency. BERT, using contextual embeddings (
Figure 3c), emphasized less frequent but semantically meaningful words like “totem” (totem) and “dispensador” (dispenser), which were commonly purchased during the COVID-19 pandemic but were not exclusive to healthcare.
These results align with structural differences: linear models prioritize frequency, while BERT identifies semantic relationships. Word Embeddings in linear models capture some context but remain frequency-driven.
We do not present the results from the “Extract Investigation Patterns” and “Extract Potential Spurious Patterns” steps individually, as these results are closely integrated with the “Cluster Patterns” step. Presenting them separately would introduce redundancy and a lack of integration. We present the “Clustering Patterns” results applied to the SVM model, which uses TFIDF, on the Contracts dataset for the class “Other procurements” and compare them with SVM using Word Embeddings and BERT. These results reflect the trend observed. In these steps, we used , corresponding to the top 50 most important words extracted in the previous step, , to ensure the selection of similar sentences in both classes, and , limiting the maximum pattern size to three words.
Figure 4 shows the heatmap illustrating the contribution of each metric to the three principal components based on PCA weights. Warmer colors (red shades) indicate positive contributions, with higher intensity representing a stronger influence on the component. Cooler colors (blue shades) represent negative contributions, with darker shades indicating a stronger negative influence. Neutral or near-zero values are shown in light colors, signifying a minimal impact.
6.1. Interpreting PCA and Clusters in the “Other Acquisitions” Class
For PC1, the variables with the highest contributions are (frequency in similar sentences from the same class) with 0.37, (frequency in similar sentences from the opposite class) with 0.38, (frequency in error sentences from the same class) with 0.34, (frequency in error sentences from the opposite class) with 0.38, (perturbation rate) with 0.37, and (perturbation count) with 0.37. These metrics indicate that PC1 is strongly associated with the frequency of patterns across different contexts (training and error) and the perturbations applied to the dataset. The proximity of the load values suggests that PC1 captures an integrated behavior between frequency metrics and the effects of perturbations, revealing the model’s sensitivity to highly frequent patterns in training and testing and highlighting how perturbations directly influence frequency-related metrics.
In PC2, the most influential metrics are (the ratio of frequencies in similar sentences from opposite classes and the same class) with −0.46, (the ratio of error frequencies from opposite classes and the same class) with 0.51, (the accuracy improvement rate after perturbation) with 0.45, and (the number of corrected predictions) with 0.44. PC2 captures the relative impact of correlations between classes and the performance gains achieved by removing the analyzed patterns. The presence of and suggests that PC2 reflects the model’s reliance on patterns influencing both classes, while and indicate how these relationships can be leveraged to correct incorrect predictions. The balance between proportion metrics and correction metrics highlights the interaction between relative frequency and adjusted performance.
In PC3, the relevant metrics are with −0.44, (the global weight of the pattern) with 0.38, and with 0.37. This principal component focuses on the influence of specific patterns with high frequencies in error sentences and their global importance. The simultaneous presence of and indicates that PC3 captures how individually important patterns contribute to classification errors, reflecting a more localized behavior in the error space.
In summary, (i) PC1 reflects the influence of pattern frequencies and perturbations on predictions; (ii) PC2 captures the relationships between relative frequencies and performance gains from pattern removal; and (iii) PC3 emphasizes the impact of specific patterns on classification errors.
Table 9 presents the contributions of metrics to the principal components (PC1, PC2, and PC3) for SVM-C0.
The cluster analysis tested the hypothesis that patterns farther from centroids have a higher potential for spuriousness due to their variability. To validate this, we analyzed the patterns within each cluster based on their projections (distance to the centroid) in the principal components (PC1, PC2, and PC3) and their respective metrics.
Figure 5 presents a three-dimensional plot of the clustering result obtained using K-means with
k = 2 (determined from the elbow method for the three components).
6.1.1. Cluster 0 Analysis
Closest Pattern: The pattern “empresa” (company) has a distance of 0.192. This pattern shows balanced projections (relative to the centroid) across the three principal components, aligning with the average characteristics of the cluster. In PC1 (3.02), associated with frequency metrics, the pattern reflects moderate values for (frequency in similar sentences from the same class) and (frequency in similar sentences from the opposite class), indicating low frequency in similar sentences. In PC2 (0.95), which captures ratios between frequencies, the pattern shows consistent values for (the ratio of frequencies in similar sentences) and (the ratio of frequencies in error sentences), demonstrating stability in the relationship between frequencies of similar and error sentences. Finally, in PC3 (0.72), related to global weights and frequencies in error sentences, (global weight of the pattern) exerts a moderate influence, indicating alignment with the average characteristics of the cluster.
Most Distant Patterns: The most distant patterns from the centroid displayed extreme projections. The pattern “servicos” (services) with a distance of 4.062, showed a high projection in PC1 (3.02) due to high frequencies in (frequency in similar sentences from the opposite class) and (frequency in error sentences from the opposite class). In PC2 (3.76), this pattern depended on ratio metrics, mainly (ratio of frequencies in error sentences), indicating significant influence in error contexts. In PC3 (), the global weight () reinforced the model’s reliance on this pattern. The pattern “contratacao empresa municipio” (contracting company municipality) with a distance of 4.270, exhibited high projections in PC3 (3.16), driven by (the global weight of the pattern) and (the frequency in error sentences from the opposite class). In PC1 and PC2, the values were moderate, suggesting that the pattern’s influence is primarily concentrated in errors.
6.1.2. Cluster 1 Analysis
Closest Pattern: The pattern “materiais” (company contracting material) shows a distance of 0.455 and typical values across the three principal components. In PC1 (−1.60), associated with frequency metrics, the pattern reflects typical values for (frequency in similar sentences from the same class) and (frequency in similar sentences from the opposite class). In PC2 (−0.28), related to frequency ratios and corrections, the pattern exhibits projections consistent with (the ratio of frequencies in similar sentences) and (the ratio of frequencies in error sentences), indicating low variability in these contexts. Finally, in PC3 (−0.03), which captures global weights and frequencies in errors, the pattern is influenced by (the global weight of the pattern), reinforcing its alignment with the cluster’s average characteristics.
Most Distant Patterns: The most distant patterns, “prestacao municipio” (provision municipal), with a distance of 3.33, and “servicos municipio” (municipality services), with a distance of 3.96, exhibited extreme projections. The pattern “prestacao municipio” (provision municipal) stood out in PC2 (3.33) due to the strong influence of (the ratio of frequencies in similar sentences) and (the ratio of frequencies in error sentences). The pattern “servicos municipio” (municipality services) showed high values in PC2 (3.68), again influenced by , and in PC3 (1.19), related to (global weight of the pattern), highlighting its relevance in classification errors and indicating greater variability and potential spuriousness.
6.1.3. Visualization of Distances to Centroids
The evaluation of the relationship between components PC1, PC2, and PC3 provides a comprehensive perspective on the representativeness of patterns and their potential for spuriousness. However, the distance of patterns to centroids can serve as a metric to assess this potential. Patterns located farther from the centroids tend to be less representative of the cluster and, therefore, have a higher probability of being spurious correlations, as demonstrated in previous examples. This distance reflects the degree of deviation of a pattern from the typical cluster behavior, indicating its inconsistency within the grouping [
34].
In the SVM-TFIDF (
Figure 6), the closest patterns in Cluster 0, “empresa” (company) and “contratacao empresa” (contracting company), exhibit low global weight (
) and reduced variations in the frequency ratios between classes (
) and between errors (
), indicating their stable occurrence. Conversely, the most distant patterns, “contratacao empresa municipio” (contracting company municipality) and “servicos” (services), have high
values and significant discrepancies between
and
, suggesting their presence may bias classification. In Cluster 1, the closest patterns, “contratacao empresa materiais” (contracting company materials) and “contratacao materiais” (contracting materials), exhibit lower oscillation in the model influence indices (
,
), while the most distant ones, “servicos municipio” (municipal services) and “prestacao municipio” (municipal provision), show significant differences between
and
, indicating a strong model dependency on these patterns.
In the SVM-WE (
Figure 7), close patterns in Cluster 0, such as “contrato prestacao” (contract provision) and “presente servicos municipio” (present municipal services), exhibit moderate
values and homogeneous distribution among classes, with slight variations in
and
. In contrast, the most distant patterns, “contratacao” (contracting) and “servicos necessidade municipio” (municipal services need), have high
values and significant discrepancies in the error ratio between classes, suggesting the model may use them in a biased manner. In Cluster 1, the closest patterns, “servicos” (services) and “municipio” (municipality), have moderate impact on model decisions, while the most distant ones, “contratacao” (contracting) and “servicos municipio” (municipal services), show high
values and significant variations in
, indicating potential model overreliance on them.
In the BERT (
Figure 8), close patterns in Cluster 0, “contratacao empresa saude” (contracting company health) and “contratacao secretaria” (contracting secretariat), present low
and
values, suggesting they are less determinant for classification. Conversely, the most distant patterns, “contratacao empresa” (contracting company) and “empresa” (company), have high
values and large oscillations between
and
, reinforcing the hypothesis that they are spurious patterns. In Cluster 1, the closest patterns, “contratacao saude” (contracting health) and “contratacao saude municipio” (contracting health municipality), exhibit balanced distribution among classes. In contrast, the most distant ones, “servicos” (services) and “contratacao” (contracting), show the highest variations in
and
, suggesting their use in a non-causal manner by the model.
The analysis of distances to the centroid reveals closer patterns that exhibit balanced distribution among classes and have a lower impact on model errors. In contrast, more distant patterns show high variation in frequency metrics and a more significant influence on model errors and perturbations. The analysis also shows that smaller clusters in each model and text representation contain patterns with higher spuriousness potential, characterized by greater average distance to the centroid, higher global weight (), more significant variation between training and error (, ), and a discrepancy between error and perturbation. The pattern “contratacao” (contracting), found in the most minor clusters, is highly frequent in the dataset. However, its asymmetric distribution between classes leads the model to use it as a shortcut for classification. Additionally, its high global weight and large oscillations in perturbation metrics indicate that it is a determining factor in predictions, even when no clear semantic relationship exists with the actual class of the sentence.
Patterns with greater distances from the centroid demonstrate significant deviations from the typical behavior observed within the cluster. These patterns exhibit higher variability, with extreme projections in metrics associated with frequencies and global weights. Consequently, they present a higher likelihood of being classified as spurious. In contrast, patterns closer to the centroid align more consistently with the average metrics of the cluster, reflecting typical cluster characteristics and reduced variability. The analysis validates the hypothesis, as input metric evaluations indicate that closer patterns display a lower degree of dependency, corresponding to a reduced risk of spuriousness compared to the more distant patterns.
6.2. Potential Spurious Patterns Common to Models
Evaluating the relationship between components PC1, PC2, and PC3 provides a comprehensive view of patterns’ representativeness and spuriousness potential. However, the distance of patterns to centroids can serve as a metric for assessing this potential. This distance reflects the degree of deviation of a pattern from the typical behavior of the cluster, indicating its inconsistency within the grouping [
34]. Following this, we analyze patterns identified as potentially spurious by at least three model–text-representation combinations in the datasets. Unlike a simple frequency-based approach, this analysis considers the actual context in which these patterns appear, assessing their influence on classification errors.
Potential Spurious Patterns in the Contracts and Bidding Datasets
Table 10 presents the most common patterns far from their centroids across models in the Contracts dataset. These patterns appear in varied contract descriptions, making them unreliable indicators of class distinction. In class 0, “municipio” (municipality) is frequently mentioned in administrative contexts unrelated to the class’s context, leading to misclassification. Similarly, “servicos” (service), “contratacao” (contracting), and “contratacao empresa” (company contracting) occur in multiple contract types, confusing models. In class 1, “aquisicao” (acquisition) is commonly found in procurement descriptions across different areas, making it an unstable predictive feature. “Fornecimento” (supply) and “covid” appear disproportionately in healthcare-related contracts but do not always define the contract type.
In the Bidding dataset, as presented in
Table 11, patterns such as “services” (services) and “contratacao” (contracting) are identified as potentially spurious in class 0. These terms appear broadly in different bidding object descriptions, leading to misclassifications. For instance, “contratacao empresa especializada” (contracting specialized company) is used in services and goods procurement, making it unreliable for class differentiation. In class 1, “aquisicao” (acquisition) and “fornecimento referencia” (supply reference) frequently occur in multiple contexts, demonstrating their instability as classification features.
The results confirm that frequent and generic patterns have a higher potential to be spurious, especially when they appear in different contexts and contribute to model errors. Just like the term “Spielberg”, the patterns “servicos” (services), “aquisicao” (acquisition), and “fornecimento” (supply) occur both in similar sentences within the same class and in the opposite class, as well as in error sentences, influencing incorrect predictions. These terms repeatedly appear in different scenarios, and the model interprets them as strong predictors, even without a causal relationship with the class. Moreover, since they are farther from the centroids, as analyzed in previous sections, they exhibit greater variability, reinforcing this tendency and increasing the risk of spurious association.
7. Experiment on the IMDB Dataset and Comparison with Reference Methods
This section evaluates the proposed method on the IMDB dataset, a widely used benchmark for sentiment classification in textual data. The dataset contains 50,000 movie reviews from the Internet Movie Database (IMDB), with 25,000 labeled as positive and 25,000 as negative [
45]. Researchers originally developed this dataset to train and test sentiment analysis models. This study applies the proposed approach to identify spurious patterns and measure its robustness by comparing its performance with reference methods.
Table 12 presents 25 patterns identified as potentially spurious for each class. The selection includes the five patterns farthest from the cluster centroids and detected by at least four models; the Distance column represents the average distance of the patterns from the centroid. According to the result of the “Extracting Potential Spurious Patterns” step, these patterns influence the models to classify test sentences into the associated class, regardless of the context. Their presence in training sentences can distort the classification, creating associations that do not represent genuine semantic features of the text.
Analyzing these patterns helps reveal biases in NLP models and assess if certain words act as decision shortcuts, ignoring sentence context. For illustration, we examine “movie” and “best”, identified as spurious by at least four models.
The term “movie” frequently appears in reviews as a neutral element, referring to the subject of analysis (the film). However, its high occurrence in Class 0 sentences suggests that models assign an undue negative influence to its presence. For example, the sentence “I do not care if some people voted this movie to be bad. If you want the truth, this is a very good movie. It has everything a movie should have. You really should get this one.” expresses a positive opinion about the film, contradicting the assigned negative classification. This behavior suggests that the model correlates the mere presence of the word “movie” with the negative class without considering the overall text semantics.
Additionally, when comparing “movie” with a similar pattern, “movie plot”, many sentences received a Class 0 label solely because they contained this pattern, even when the context positively evaluated the plot. This observation suggests that “movie” and “plot” together should not determine classification. For example, the text, “Note these comments are for people who have seen the movie. Vanilla Sky is a brilliant, complex, and thrilling movie that existentially explores exactly what the tagline says: love, hate, dreams, life, work, play, and friends. Maybe the movie plot can come into focus for confused moviegoers if one looks at it from a different angle, considering the following… This movie becomes a fascinating exploration of a human on a journey to find himself and what that means in today’s pop culture society.” received a Class 0 (negative) label just because it contained “movie plot”, despite including praise such as “brilliant, complex, and thrilling movie” and “fascinating exploration.” This result confirms that the combination “movie plot” may act as a shortcut for negative classification when it should not play a decisive role in the model’s decision.
The term “best” usually appears in positive reviews, making it expected in Class 1 sentences. However, its isolated presence can lead to incorrect classifications when used without context. For example, the sentence “The first part of Grease with John Travolta and Olivia Newton-John is one of the best movies for teens. This one is a very bad copy. The change is only in the sex. In the first one, the good one was Sandy. Here it is Michael. I prefer to watch the first Grease.” contains the term “best”, but expresses a negative critique by unfavorably comparing the analyzed movie to another. Assigning this sentence to Class 1 suggests that the model captures only the presence of the term “best” and ignores the complete argument. The comparison with the pattern “best good” reveals a similar behavior. Many sentences containing this word combination received incorrect classifications as Class 1 solely due to their presence, even when the text expressed criticism; the text “I debated quite a bit over what rating to give this one because it is my least favorite Herschell Gordon Lewis film so far other than The Gruesome Twosome, but it has the best acting I have seen in a Lewis film. However, we all know that is not saying much. Once the movie was done, I was happy because it felt like I had been sitting through a 4-hour movie, though it was only 82 min long. I am trying to see all of HGL’s films, which is probably the only reason to see this one. The gore is good as usual—the one thing that Herschell seemed to get right. The acting is just as bad as usual with one exception. That exception is Frank Kress. Now, would I say that he is a good actor? No way. But he is good compared to everyone else. The story is boring and flat and goes nowhere, and by the end, I did not care what happened, just so long as it ended”. contains the words “best” and “good” but expresses a predominantly negative evaluation of the film.
The author highlights that although the film presents the “best acting” within the director’s body of work, this does not imply that the acting is good. Additionally, the critique emphasizes that the film is “boring and flat”, reinforcing that the positive classification does not reflect the actual intent of the review. This case demonstrates that the mere presence of terms associated with positivity can lead the model to spurious correlations, resulting in inaccurate classifications.
8. Comparison with Reference Methods
We compared our method with three reference approaches. Wang and Culotta [
7], Wang and Culotta [
4], and Yadav et al. [
16], by analyzing the results of experiments conducted on the IMDB dataset.
Table 13 summarizes the main methodological differences. Wang and Culotta [
7] automatically generate counterfactuals by replacing words with antonyms to assess model robustness. Although this approach identifies spurious patterns, it is limited to isolated words and requires human intervention to annotate spurious and causal features and evaluate the counterfactuals. Wang and Culotta [
4] rely on causal inference, analyzing words with spurious influence through sentence matching. However, the need for an initial manual labeling of words as genuine or spurious constrains its applicability. Finally, Yadav et al. [
16] use Tsetlin Machines to learn interpretable patterns through logical rules. However, the reliance on manually crafted counterfactuals and the restriction to unigrams limit its generalization.
Table 13 summarizes the key differences between these methods.
In addition to the common words identified by our method, as shown in
Table 14, only our approach can detect compound patterns, such as “movie plot” and “bad movie plot”. These patterns provide a significant advantage in analyzing spurious correlations, revealing how models can learn non-causal shortcuts. For example, the inclusion of the term “movie” in the compound pattern “movie plot” suggests that the model may associate negative reviews with the joint presence of these words rather than the isolated meaning of “plot”. The detection of “bad movie plot” reinforces this hypothesis, indicating that the combination of terms may serve as a superficial cue for classifying a film as bad without considering the actual content of the review.
The reliance on human intervention varies among the compared methods. The counterfactual-based method [
7] requires manual validation to ensure that words replaced with antonyms genuinely represent spurious patterns. The causal inference approach [
4] depends on human annotators manually labeling influential words, introducing subjectivity into the process. In the Tsetlin Machine-based method [
16], manually creating counterfactuals is essential to refine the logical rules learned by the model. Our method eliminates this requirement by automating the identification and quantification of spurious patterns. XAI-based explainability extracts important words directly from the models, while clustering and perturbation analysis quantify pattern influence. This approach makes spurious pattern detection more scalable and adaptable to different domains. However, as in any scientific study, analyzing the results may involve domain experts interpreting detected patterns and validating relevant findings. This step is not an operational dependency of the method but rather a common practice to ensure the robustness and applicability of conclusions in specific contexts.
9. Conclusions
This study proposes an agnostic method that combines XAI and unsupervised learning to detect, interpret, and quantify spurious patterns in binary NLP classification. Unlike existing methods, our approach introduces an automated, data-driven framework that identifies spurious correlations without the prior selection of candidate patterns. This flexibility allows the method to generalize across different datasets and classification models. The following items present the key innovations and contributions of this work. Methodological Innovations:
Automatic detection of composite patterns: Identifies complex spurious patterns without predefined candidates or human intervention.
Centroid-distance metric: Measures model dependence on spurious features and improves cluster interpretability.
Empirical Contributions:
Model-agnostic detection: Applies to different classifiers without requiring model-specific adaptations.
Flexible pipeline: Allows for multiple feature importance and clustering techniques.
Datasets and results: Public Bidding and Contracts datasets and results are available for future comparisons.
Discussion: The results demonstrate that the proposed method effectively identifies spurious patterns across different classification models and datasets, reinforcing the hypothesis that patterns farther from cluster centroids exhibit a higher potential for spuriousness. Compared to traditional approaches that rely on predefined assumptions about feature importance, our method provides a fully automated and model-agnostic framework, reducing human intervention in the identification process. However, the dependence on clustering stability raises concerns about sensitivity to dataset-specific properties, particularly in scenarios where spurious patterns are more evenly distributed. While K-means effectively segments spurious and non-spurious patterns, its reliance on Euclidean distances may not fully capture the complexity of contextual relationships in textual data, suggesting that alternative clustering methods should be explored. Additionally, our findings align with previous studies indicating that linear models tend to overemphasize frequent patterns, while BERT captures more nuanced dependencies, yet still inherits biases from training data. These findings highlight the influence of explainability techniques on model bias detection. Since different XAI methods may assign varying importance to features, future research should investigate whether combining multiple interpretability techniques enhances the robustness of spurious pattern identification.
Additionally, we evaluate the methodology of the IMDB dataset, comparing its performance with existing approaches to assess generalization across different text classification tasks. The experimental setup introduces certain limitations that do not arise from the methodology itself, given its model-agnostic nature:
Sensitivity to K-means configurations: Identifying spurious patterns relies on K-means clustering, but determining the optimal number of clusters using the elbow method can be unstable due to data variations.
Reliance on XAI models: Using LIME and intrinsic coefficients may introduce biases in the assignment of word importance.
Limitations on pattern size: The experiment analyzes patterns of up to three words (n = 3) and does not cover longer dependencies, potentially underestimating some spurious correlations.
Visualization constraints: Pattern validation relies on the manual analysis of tables and graphs, limiting expert assessment.
However, the methodology was designed to accommodate such adaptations, ensuring flexibility for future improvements. Building on these contributions, future research will explore the following.
- 1.
Refinement of the clustering step: Investigating additional unsupervised learning methods, including DBSCAN, Agglomerative Clustering, Hierarchical Clustering, and direct outlier detection techniques, to improve the detection of spurious patterns.
- 2.
Enhancement of explainability techniques: Comparing different XAI frameworks to assess their impact on the selection of spurious patterns and reduce potential biases in explainability methods.
- 3.
Scaling to longer pattern dependencies: Extending the analysis beyond three-word sequences to capture more complex spurious correlations in NLP models.
- 4.
Development of interactive visualization tools: Creating interfaces that allow domain experts to dynamically explore detected patterns and validate their impact on model decisions.
Additionally, this approach has potential applications beyond NLP. The ability to quantify model dependency on features without requiring labeled spurious patterns suggests its applicability in structured data analysis, feature selection, and explainability-driven model debugging. Future work will explore these possibilities and assess the method’s effectiveness in broader machine learning applications.