The Detection of Spurious Correlations in Public Bidding and Contract Descriptions Using Explainable Artificial Intelligence and Unsupervised Learning

Soares, Hélcio de Abreu; Moura, Raimundo Santos; Machado, Vinícius Ponte; Paiva, Anselmo; Lima, Weslley; Veras, Rodrigo

doi:10.3390/electronics14071251

Open AccessArticle

The Detection of Spurious Correlations in Public Bidding and Contract Descriptions Using Explainable Artificial Intelligence and Unsupervised Learning

by

Hélcio de Abreu Soares

^1,*

,

Raimundo Santos Moura

²

,

Vinícius Ponte Machado

²

,

Anselmo Paiva

³

,

Weslley Lima

²

and

Rodrigo Veras

²

¹

Departamento de Informática, Instituto Federal do Piauí, Teresina 64000-040, Piauí, Brazil

²

Departamento de Computação, Universidade Federal do Piauí (UFPI), Teresina 64049-550, Piauí, Brazil

³

Núcleo de Computação Aplicada, Universidade Federal do Maranhão, São Luís 65080-805, Maranhão, Brazil

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1251; https://doi.org/10.3390/electronics14071251

Submission received: 3 December 2024 / Revised: 12 March 2025 / Accepted: 18 March 2025 / Published: 22 March 2025

(This article belongs to the Special Issue Advanced Natural Language Processing Technology and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Artificial Intelligence (AI) models, including deep learning and rule-based approaches, often function as black boxes, limiting transparency and increasing uncertainty in decisions. This study addresses spurious correlations, defined as associations between patterns and classes that do not reflect causal relationships, affecting AI models’ reliability and applicability. In Natural Language Processing (NLP), these correlations lead to inaccurate predictions, biases, and challenges in model generalization. We propose a method that employs Explainable Artificial Intelligence (XAI) techniques to detect spurious patterns in textual datasets for binary classification tasks. The method applies the K-means algorithm to cluster patterns and interprets them based on their distance from centroids. It hypothesizes that patterns farther from the centroids are more likely to be spurious than those closer to them. We apply the method to public procurement datasets from the Court of Auditors of Piauí (TCE-PI) using models based on Support Vector Machine (SVM) and Logistic Regression with text representations from TFIDF and Word Embeddings, as well as a BERT model. The analysis is extended to the IMDB movie review dataset to evaluate generalizability. The results support the hypothesis that patterns farther from centroids exhibit higher spuriousness potential and demonstrate the clustering’s consistency across models and datasets. The method operates independently of the techniques used in its stages, enabling the automatic detection and quantification of spurious patterns without prior human intervention.

Keywords:

NLP; XAI; binary classification; spurious correlations; spurious patterns

1. Introduction

Artificial Intelligence (AI) models, including Neural Networks and Support Vector Machines, are often considered black boxes due to their lack of interpretability, which raises reliability concerns. A key challenge is their tendency to learn spurious correlations—patterns that seem predictive but fail under distribution shifts [1]. These correlations frequently stem from dataset artifacts, such as selection bias or intentionally embedded manipulations known as backdoors [2]. Additionally, confounders—unintended variables that distort the relationship between features and output classes—can further mislead models, leading to systematic errors and reduced generalization [3]. In Natural Language Processing (NLP), models often depend on superficial patterns rather than genuine linguistic structures. This phenomenon, known as shortcuts, enables high accuracy during training but leads to poor generalization [4]. A well-known example is the Clever Hans effect, where models exploit irrelevant cues instead of learning the actual task [5]. Similarly, backdoors refer to deceptive biases—which may be intentionally inserted—that models leverage to manipulate predictions [6].

Wang et al. [4,7] reported an illustrative case where a sentiment classifier incorrectly associates the word “spielberg” with positive reviews. Instead of evaluating sentiment, the model learns that reviews mentioning “Spielberg” are frequently positive in the training set, leading to misclassification when the word is absent. This example highlights how models capture non-causal relationships, resulting in unreliable predictions.

Researchers have developed Explainable Artificial Intelligence (XAI) techniques to improve model interpretability and expose biases [8]. While not explicitly designed to detect spurious correlations, XAI methods help identify artifacts by highlighting highly weighted features lacking causal relevance [9,10,11]. Prior research has shown that spurious patterns are a significant source of model errors, mainly when data labels contain inconsistencies [12,13].

This work presents a method that combines XAI-based feature importance with unsupervised learning to cluster patterns logically and interpretably, enabling a structured analysis of their association with output classes. Our heuristic uses statistical metrics from model–data interactions to catalog misclassified instances, extract frequent words from similar sentences, and assess their impact. It perturbs the data by removing these words from test samples and evaluates changes in sensitivity and specificity. Unsupervised learning then clusters the detected patterns, uncovering their relationships and likelihood of being spurious. To validate our approach, we apply it to datasets of Public Procurement Contracts and Bidding Processes, which present challenges due to their formal structure and domain-specific vocabulary. We also conduct experiments on IMDB, assessing the method’s generalizability and comparing it with existing techniques.

Unlike previous approaches, which focus on individual words, our method detects multi-word patterns, capturing more complex spurious correlations. The novelty of our approach lies in three key aspects: it operates without human intervention, enabling the fully automated detection of spurious patterns; it introduces the distance to cluster centroids as a metric to gradually quantify spuriousness instead of a binary classification; and it identifies compound patterns, allowing for a more nuanced analysis of spurious dependencies. The main contributions of this work are as follows.

A model-agnostic method for detecting and quantifying spurious patterns, compatible with XAI techniques. The algorithm code and usage instructions are available on our Github page: https://www.github.com/HelcioSoares-IFPI/SpuriousPatternTracker/releases/tag/v1.0 (accessed on 19 march 2025).
An automated, unsupervised approach that eliminates the need for the manual identification of candidate spurious patterns.
The clustering of spurious patterns into interpretable groups, introducing distance to cluster centroids as a metric for measuring spuriousness.
Real-world datasets of public procurement contracts for binary classification and spurious correlation analysis, available at the mentioned GitHub link.

These contributions set our method apart from existing techniques, providing a more comprehensive understanding of spurious correlations in text classification.

Although our focus is on detection, several techniques exist to mitigate spurious correlations, including statistical reweighting, regularization, and counterfactual generation [14,15]. By identifying spurious patterns, our approach lays the foundation for the application of such corrective strategies.

2. Related Work

Recent research explores methods to detect and mitigate spurious correlations in AI models, particularly in classification tasks. This section categorizes the key approaches into four distinct groups.

2.1. Counterfactual Generation

Counterfactual generation modifies input instances to determine whether a model depends on spurious correlations or meaningful patterns. Researchers apply this technique using different methodologies.

Wang and Culotta [4] generate counterfactuals by replacing specific words in the input. Their method uses statistical techniques to identify words that are highly correlated with model predictions and assesses their impact on classification. However, this approach focuses solely on surface-level modifications and does not ensure semantic coherence. Yadav et al. [16] introduce a Tsetlin Machine-based approach to generate counterfactual rules. Instead of modifying individual instances, their method learns logical patterns that differentiate classes, enabling counterfactual generation with structured regulations. However, this approach faces challenges in generalizing learned patterns to complex contexts. Liu et al. [17] examine whether deep learning models perform logical reasoning or rely on statistical correlations. They construct a counterfactual dataset to test whether models maintain logical consistency when data structures change. Their results show that models exploit spurious patterns in training data and fail under systematic modifications. Veitch et al. [15] analyze counterfactual generation from a causal inference perspective. They argue that generating counterfactuals without considering causal structures may introduce new biases. They propose conditional independence-based regularization to approximate counterfactual invariance, demonstrating that different causal relationships require distinct mitigation techniques.

Counterfactual generation provides a method for assessing and improving model robustness. However, the reviewed studies highlight the need to ensure that counterfactuals maintain semantic validity and that modifications influence predictions meaningfully. Selecting a counterfactual generation method requires considering the causal structure of the data to avoid introducing artificial patterns that may induce biases.

2.2. Data Perturbation

Data perturbation alters textual inputs to assess model robustness. This technique introduces controlled modifications, such as word replacements and removals, syntax changes, and adversarial noise, to determine if a model relies on meaningful patterns or superficial statistical artifacts. Unlike counterfactual generation, which aims to create semantically valid alternative samples, data perturbation focuses on testing model sensitivity by systematically distorting input structures.

Wang and Culotta [7] employ perturbation techniques to replace words strongly correlated with the target label, evaluating the model’s reliance on specific lexical features. By altering these words, they assess whether models depend on linguistic cues or learn shallow correlations. Their findings reveal that simple word substitutions significantly affect model predictions, highlighting the presence of spurious dependencies. Wu et al. [14] propose an automated framework to generate and filter training samples, aiming to mitigate biases by removing instances that reinforce spurious correlations. Their method perturbs text by modifying task-independent features and evaluates the impact using z-statistics. The filtering process eliminates samples that exhibit strong correlations with non-task-related features, resulting in datasets that promote more generalizable models. However, this approach depends on prior knowledge of biased attributes, which may not be available in all domains. Zhang et al. [18] systematically analyze the effects of different perturbation strategies on neural NLP models. They introduce a learnability metric to measure how well a model adapts to specific perturbations. Their experiments show that models sensitive to perturbations—such as word scrambling, character alterations, and punctuation manipulations—tend to exhibit lower robustness. This suggests that models overfitting to dataset artifacts struggle to generalize to unseen variations. While their approach provides valuable insights, the computational cost of measuring learnability may limit its practical adoption.

Although counterfactual generation can complement perturbation-based approaches by introducing semantically coherent modifications, data perturbation primarily aims to stress-test models by introducing structured distortions. The reviewed studies demonstrate that perturbation techniques effectively uncover weaknesses in NLP models, though challenges remain in selecting meaningful perturbations and ensuring computational efficiency.

2.3. Explainable AI Techniques for Detecting Spurious Correlations

Explainable Artificial Intelligence (XAI) encompasses methods designed to make machine learning models more transparent by providing insights into their decision-making processes. While the primary goal of XAI is to enhance interpretability, several studies demonstrate that these techniques can also help detect spurious correlations by highlighting influential patterns in model predictions.

Two widely used explanation techniques are Local Interpretable Model-Agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP). LIME approximates a black-box model by locally fitting an interpretable model, such as linear regression, to perturbed input versions. It assigns feature importance scores based on how modifications to input variables affect predictions [19]. Based on cooperative game theory, SHAP computes Shapley values to quantify the marginal contribution of each feature to a model’s prediction, ensuring a consistent and theoretically grounded attribution method [20]. Cardozo et al. [21] introduce Explainer Divergence Scores (EDSs) to measure how different explanation methods reveal dependencies on spurious patterns. Their findings suggest that while LIME and SHAP provide meaningful insights, they often highlight spurious features without explicitly distinguishing them from genuine predictive patterns. Soares et al. [22] propose combining XAI with human evaluation to filter misleading patterns in textual datasets. Their results indicate that LIME effectively pinpoints spurious dependencies by assigning high importance to features frequently associated with incorrect predictions. SHAP, in contrast, offers a more stable attribution across multiple samples, making it useful for detecting persistent spurious correlations.

However, challenges remain in distinguishing causal from non-causal dependencies. Chou et al. [11] argue that explanations might reinforce biases if not appropriately validated. They note that feature attribution methods frequently emphasize co-occurring patterns, which may not always reflect authentic causal relationships.

2.4. Other Techniques: Statistical Reweighting and Regularization Methods

In addition to the techniques presented, new strategies address spurious correlations in text classification. Two approaches are statistical reweighting and regularization methods, which aim to reduce the model’s dependence on misleading patterns by adjusting the training data distributions and modifying the learning process, respectively.

Statistical reweighting modifies the weight of training instances to counteract dataset biases. The technique adjusts sample importance to prevent spurious correlations from dominating model learning. Serrano et al. [23] formulate an optimization-based reweighting method that redistributes instance weights to mitigate lexical biases. Their study applies the method to natural language inference and duplicate-question detection tasks, where individual word occurrences should not influence predictions. Statistical reweighting simultaneously reduces thousands of spurious correlations and aligns label distributions more uniformly across multiple lexical features. Despite reducing biases at the data level, trained models retain significant biases. While unigram biases decrease, associations between more complex patterns (e.g., bigrams) intensify, showing that reweighting alone does not eliminate spurious dependencies. The study demonstrates that mitigating biases in NLP requires modifications beyond data reweighting, including adjustments to model architectures and learning strategies to address deeper correlations. Regularization techniques adjust the training process to prevent models from depending on spurious patterns. Chew et al. [5] propose a family of regularization methods called NFL (doN’t Forget your Language), which limits changes in parameter updates to reduce overfitting to spurious correlations. Their approach applies neighborhood analysis to show how spurious correlations cluster unrelated words in the embedding space, causing models to misattribute importance to non-causal tokens. NFL addresses this issue through different constraints. NFL-F (frozen) preserves original word representations by freezing pre-trained model parameters and fine-tuning only the classification head. NFL-CO (Constrained Outputs) maintains pre-trained semantic relationships by introducing a loss term that minimizes cosine distance between token representations before and after fine-tuning. NFL-CP (Constrained Parameters) prevents drastic shifts in learned representations by penalizing excessive parameter updates. NFL-PT (prompt-tuning) enables adaptation without modifying learned embeddings by using continuous prompts instead of fine-tuning the entire model. Their results show that NFL significantly improves model robustness against spurious correlations without requiring external datasets. However, they also note that no single regularization strategy is universally effective, and each variation balances generalization and robustness differently.

3. Theoretical Framework

3.1. Natural Language Processing (NLP)

Natural Language Processing (NLP) focuses on the interaction between computers and human language, particularly in text classification tasks. NLP models apply algorithms that assign texts to categories, aiming to maximize the probability of a correct prediction, defined as

\hat{y} = argmax Ψ (x, y; θ)

, where x is the input sentence, y is the predicted output, and

Ψ

represents the scoring function based on the parameters

θ

[24].

This study employs three models for binary classification. The first model is Logistic Regression (LRG), a linear classifier widely used in text classification due to its scalability and interpretability, which allow direct feature importance analysis [25]. However, its linear nature restricts its ability to capture complex word dependencies. LRG was selected for its intrinsic explainability, which aligns with our model-agnostic approach as a baseline for comparing the performance of more complex models.

The second model is a Support Vector Machine (SVM) optimized with Stochastic Gradient Descent (SGD), which is a linear classifier that maximizes the margin between decision boundaries by optimizing a hinge loss function [26]. Unlike traditional kernel-based SVMs, SVMs with SGD prioritize efficiency over non-linearity, making it a practical choice for large-scale text datasets. SVM was selected due to its efficiency in handling high-dimensional text data. Prior studies [22] demonstrate that this model performs well in detecting spurious correlations in datasets similar to those used in this research.

BERT (Bidirectional Encoder Representation from Transformers) is a transformer-based architecture that processes textual data using self-attention mechanisms [27]. It captures bidirectional dependencies, providing a deeper contextual understanding of words and sentences. Pretrained on large corpora using masked language modeling, BERT undergoes fine-tuning to adapt to specific classification tasks. This study uses BERT in two roles: as a Word Embedding extractor for the linear models and as a binary classifier. By generating dense, contextually rich representations, BERT enhances the ability of linear models to identify relevant patterns. As a classifier, it allows an analysis of its sensitivity to spurious patterns compared to linear models. Additionally, its compatibility with the LIME post hoc explainability technique facilitates model interpretation, which is essential for this research.

Term Frequency–Inverse Document Frequency (TFIDF) represents text numerically by weighting term occurrences based on their relevance across a document set [28]. TF measures how often a word appears in a document, while IDF adjusts this frequency according to how commonly the word appears in the entire corpus. This study selects TFIDF because it enables linear models to extract relevant patterns, and prior research [22] showed its effectiveness in detecting spurious correlations in a similar dataset for this study.

In this study, we generate Word Embeddings (WEs) using the BERTimbau model via the Sentence-BERT (SBERT) framework, which optimizes BERT for sentence-level embeddings by transforming each sentence into a dense vector representation. The SVM and LRG classifiers use these embeddings as input features. Since these models do not fine-tune embeddings, we keep them fixed during training. These embeddings encode semantic relationships, allowing linear models to make predictions based on richer text representations while maintaining interpretability [29].

3.2. Explainable Artificial Intelligence (XAI)

Explainable Artificial Intelligence (XAI) techniques enhance the interpretability of AI models, making their predictions more transparent [8]. Although not their primary purpose, these techniques help identify spurious patterns and causal relationships [10,30]. XAI approaches can be local, such as Local Interpretable Model-Agnostic Explanation (LIME), which explains individual predictions by generating perturbed versions of an input instance and fitting a simple interpretable model to approximate the decision boundary of the complex model in that local region. Conversely, global methods, such as SHapley Additive exPlanations (SHAP), quantify feature importance across the entire model by computing Shapley values, which estimate each feature’s marginal contribution based on cooperative game theory [20].

3.3. Unsupervised Learning

Unsupervised learning is a machine learning approach that analyzes unlabeled data to identify inherent patterns and structures without predefined categories; unlike supervised learning, which relies on labeled datasets, unsupervised learning uncovers relationships and clusters within the data based on similarity metrics [31]. This study applies K-means, the Elbow Method, and Principal Component Analysis (PCA) to explore the statistical properties of datasets and create logical and interpretable clusters. PCA reduces dimensionality by transforming correlated variables into uncorrelated principal components, preserving most variance, simplifying complex data analysis, and aiding in pattern identification [32]. The Elbow Method determines the optimal k value by evaluating the Within-Cluster Sum of Squares (WCSS) reduction to identify the balance between accuracy and complexity [33]. K-means partitions data into k clusters, minimizing the sum of squared distances between points and their centroids.

This study applies K-means to segment spurious patterns into clusters, providing a structured way to analyze their distribution. We choose K-means based on its ability to assign patterns to clusters using centroids that summarize the main characteristics of each group [34]. Furthermore, this study hypothesizes that patterns further from the centroid may exhibit properties associated with spurious correlations as they deviate more from the core structure of the data. Similarly, the study explores whether outliers in the clustering process correspond to patterns significantly influencing model errors.

4. Method

The proposed method organizes the process into a pipeline comprising seven stages—Preprocessing, Train/Test Model, Extract Important Words, Analyze Errors, Extract Investigation Patterns, Extract Potential Spurious Patterns, and Cluster Patterns—as illustrated in Figure 1. Overlapping lines in the stages indicate cross-validation (

k = 5

) to evaluate the pipeline in different contexts. Each stage applies techniques based on the specific model, ensuring the approach remains model-agnostic. The concepts and notations follow the conventions defined by [4,35].

4.1. Datasets

We used datasets specific to the context of the TCE-PI, named (i) Contracts Dataset (C) and (ii) Bidding Dataset (L).

The Contracts Dataset [36] supports the development of an automatic classification model for public administration contracts, focusing on COVID-19-related expenditures. A team of 12 TCE-PI specialists manually labeled the dataset by analyzing contracts published in the Official Gazettes of Piauí between March and September 2020. Each specialist labeled a specific set of contract descriptions, classifying them into healthcare-specific acquisitions (label 1) and other procurements (label 0). To improve consistency, another auditor reviewed a sample of the annotations. When auditors disagreed, the chief auditor made the final decision, ensuring consensus in contentious cases. This structured process—combining individual annotations, cross-reviews, and expert arbitration—ensured a reliable and well-curated dataset.

For this study, the team expanded it with additional contract descriptions from Sistema Contratos—Web (https://sistemas.tce.pi.gov.br/contratosweb/, accessed on 19 March 2025), a TCE-PI platform where jurisdictional entities register contracts as the law requires. The specialists applied the same labeling process to these new entries, adding 1727 sentences. Table 1 summarizes the dataset quantities before and after the expansion.

The Bidding Datasets are being developed to evaluate an architectural approach that identifies signs of fraud in public procurement, as described in [37]. It consists of procurement notices published between 2012 and 2023, recorded in the Licitações-Web (https://sistemas.tce.pi.gov.br/licitacoesweb/, accessed on 19 March 2025) system of TCE-PI. The system allows jurisdictional entities to register their contracts as required by current legislation. The data extraction process used regular expressions to isolate the section describing the procured object. To verify the accuracy of the extraction, we trained a BERT model specifically for this task. A panel of three public procurement specialists manually labeled the data into four main categories, as described in [38]. Table 2 shows the quantities of labeled data for each class at the time of dataset release for our research:

For this study, we expanded the original dataset using Active Learning [39] with specific adaptations for binary classification. We isolated sentences describing the procurement object. We adapted the classes by grouping “Engineering Works Contracting Services” and “General Services Contracting” as class 0 and “Acquisition of Permanent Goods” and “Acquisition of Consumable Goods” as class 1. We used the labeled dataset to train a BERT model, applying it to 200 additional sentences, which specialists reviewed and corrected. This cycle of classification and review continued until the quantities shown in Table 3 were reached.

We define the dataset as follows. Let

W = {w_{1}, w_{2}, . . ., w_{n}}

be the set of words and symbols in a language, and let

S \subseteq W^{+}

be the set of valid sentences, where

W^{+}

represents the set of all finite non-empty sequences of elements from W. The dataset is partitioned into disjoint training and testing sets,

S_{t r a i n} \subset S

and

S_{t e s t} \subset S

, ensuring

S_{t r a i n} \cap S_{t e s t} = \emptyset

. The classification task is defined over the label set

Y = {0, 1}

, where each sentence is assigned a unique label. The training and testing datasets are then given by

D_{t r a i n} = {(s_{r}, y) ∣ s_{r} \in S_{t r a i n}, y \in Y}

and

D_{t e s t} = {(s_{t}, y) ∣ s_{t} \in S_{t e s t}, y \in Y}

, where each sentence

s_{r}

or

s_{t}

is uniquely associated with a single label y.

4.2. Preprocessing

For the LRG and SVM models, we applied text cleaning, tokenization, stop word removal, and TFIDF vectorization. For the BERT model, we performed subword tokenization using WordPiece, added unique tokens ([CLS] and [SEP]), applied padding to standardize text length, and used segment embeddings to differentiate text segments.

4.3. Train/Test the Model

We structured the training and testing process by implementing a k-fold cross-validation with k = 5, generating five independent folds. These folds functioned as fixed partitions of the dataset, ensuring that each instance participated in training and testing in different contexts. This approach was applied at all stages of the method as per the overlapping lines shown in Figure 1.

For the LRG and SVM models, we optimized hyperparameters using Bayesian optimization [40], which was performed exclusively on the training set of each fold. Within each fold, we conducted an additional

k = 5

internal cross-validation using only the training data to guide the search for optimal hyperparameters. After selecting the best configuration, we trained the model on the entire training portion of the fold and evaluated it on its corresponding test set, ensuring a robust and unbiased assessment. For the BERT model, we applied fine-tuning, regularization, dropout, checkpoints, early stopping, pruning, and padding to enhance performance.

We define the model trained on the training data

D_{t r a i n}

as

M_{(f, g)} = f (g (D_{t r a i n}))

, where

g : s \mapsto x

converts a sentence s into a vector representation x.

For the LRG and SVM models, g applies two distinct text representations: TFIDF and Word Embeddings. When using TFIDF, g transforms a sentence into a sparse vector based on term frequency-inverse document frequency weighting. When using Word Embeddings, g tokenizes the text and processes it through a pre-trained BERT model, generating dense vector representations that encode contextual information. These embeddings remain fixed during training and serve as input for the linear classifiers. The classification function

f : x \mapsto y, y \in Y

remains a linear mapping, learning decision boundaries in both TFIDF and Word Embedding spaces. The training process optimizes f while keeping g fixed, ensuring that the classifiers leverage lexical and semantic information without modifying the underlying embeddings.

For the BERT model, g applies the BERT tokenizer and processes the tokenized text through transformer layers to generate contextual embeddings. Unlike linear models, where embeddings are pre-computed and static, BERT fine-tunes its parameters during training. The classification function in BERT consists of a dense network applied to the [CLS] token embedding, using sigmoid or softmax activation for classification. In this case, both f and g are optimized during training, allowing BERT to refine its representations based on task-specific data.

Another task in this step catalogs model errors by defining

D_{e r r o}

as the set of sentence-class tuples from the test set where the model made incorrect predictions, as shown in the following notation:

D_{e r r o} = {(s_{e}, y_{i}) ∣ (s_{e}, y_{i}) \in D_{t e s t}, y_{i} \neq M_{(f, g)} (s_{e})} .

(1)

Here,

s_{e}

represents the error sentence,

y_{i}

is its actual class, and

M_{(f, g)} (s_{e})

is the model’s prediction. This set includes only the tuples where

y_{i}

differs from the prediction.

4.4. Analyze Model Errors

The error analysis identifies patterns that contribute to incorrect model predictions. We examine error sentences (

s_{e}

) and similar sentences from the training set (

D_{t r a i n}

). We divide similar sentences into two categories: those with the same class label as

s_{e}

(similar sentences from the same class) and those with the opposite class label (similar sentences from the opposite class). These sentences influence the model’s incorrect predictions [12,13]. Figure 2 illustrates this relationship, where

s_{e}

is the error sentence,

s_{r}

represents a similar sentence in the training set, and

s i m (s_{e}, s_{r})

indicates the similarity value between

s_{e}

and

s_{r}

. Similar sentences from the same class as

s_{e}

are shown in orange, while sentences from the opposite class are in blue.

For the SVM and LRG models, we used cosine similarity with TFIDF and Word Embeddings. For the BERT model, we applied BERTScore [41], which incorporates the context of words through contextualized embeddings. The final result produces a structure

R_{y}

with two dimensions corresponding to classes 0 and 1, as shown in the following notation:

R_{y} = {(s_{e}, (M (s_{e}), O (s_{e}))) ∣ s_{e} \in D_{e r r o}}

(2)

Here,

s_{e}

represents the error sentence from class y;

M (s_{e})

denotes the set of sentences similar to

s_{e}

within the same class:

M (s_{e}) = {(s_{r}, s i m (s_{e}, s_{r})) ∣ (s_{r}, y_{r}) \in D_{t r a i n}, y_{r} = y, s i m (s_{e}, s_{r}) \geq λ},

(3)

while

O (s_{e})

represents the set of sentences similar to

s_{e}

from the opposite class:

O (s_{e}) = {(s_{r}, s i m (s_{e}, s_{r})) ∣ (s_{r}, y_{r}) \in D_{t r a i n}, y_{r} \neq y, s i m (s_{e}, s_{r}) \geq λ} .

(4)

The function

s i m (s_{e}, s_{r})

measures the similarity between

s_{e}

and

s_{r}

, and

λ

represents the minimum similarity threshold.

4.5. Extracting Important Words

This phase aims to identify the most relevant words per class and their corresponding weights. We define specific requirements for the explainers, as shown in the following notation:

E_{x p} (M_{(f, g)}, s) \mapsto E_{L (y)} = {e_{l}}, s \in S, e_{l} = (w_{l}, p_{l}) \in W \times R

(5)

Here,

E_{x p}

represents the explainer that, given a model

M_{(f, g)}

and a sentence s, generates sets of ordered pairs

E_{L (y)}

. Each pair

{e_{l}}

consists of a token

w_{l}

and a weight

p_{l}

, where

p_{l}

reflects the importance of

w_{l}

for class y, providing insights into how

M_{(f, g)}

makes predictions. For intrinsically explainable models,

E_{x p}

refers to the model’s internal explanation mechanism; for example, in Logistic Regression models, the coefficients (weights) indicate the strength and direction of the association between a feature and the prediction [42]. For a sentence s, the process generates two sets:

E_{L (0)}

, which contains the most relevant tokens and their weights for class 0, and

E_{L (1)}

, which is the same for class 1.

To achieve this phase’s goal, we calculate the global importance of each token. We define the global explanation as follows:

E_{G (y)} = {e_{g}}, e_{g} = (w_{g}, p_{g}) \in W \times R

(6)

Here,

p_{g}

represents the global weight of the token

w_{g}

for class y. This process produces two sets,

E_{G (0)}

and

E_{G (1)}

, which contain the most relevant tokens and their weights for classes 0 and 1, respectively.

4.6. Extracting Investigation Patterns

In this study, we define investigation patterns as word combinations that frequently appear in error sentences and are considered important for the opposite class. If a pattern often occurs in sentences similar to the error sentence but belonging to the opposite class, we interpret it as the model possibly relying on it as a shortcut. Conversely, if a pattern appears in sentences similar to the error sentence within the same class, we interpret it as the model possibly using this pattern as a confounding factor.

We define the investigation patterns P as word combinations as follows: (i) they appear in error sentences of a specific class and in similar sentences from the opposite class; (ii) they consist of the m most important words for the class; and (iii) they are present in both error sentences and similar sentences from the same class and the opposite class. The formal definition of investigation patterns is given by

P = {c ∣ c = (w_{1}, \dots, w_{n}), w_{i} \in {Top}_{m} (E_{G (y)}), c \subseteq s_{e}, c \in M (s_{e}) \land c \in O (s_{e})}

(7)

where c represents a word combination

(w_{1}, w_{2}, \dots, w_{n})

,

w_{i} \in {Top}_{m} (E_{G (y)})

indicates that each word in the pattern is among the m most important words for class y,

c \subseteq s_{e}

means that the error sentence

s_{e}

includes the pattern, and

c \in M (s_{e}) \land c \in O (s_{e})

indicates that c must appear in similar sentences from both the same class (

M (s_{e})

) and the opposite class (

O (s_{e})

).

This definition identifies patterns that may contribute to confusion between classes and considers them as potential artifacts or confounders. To measure the spuriousness degree of each pattern, in later steps, we calculate the frequency metrics of these patterns in different contexts: (i) error sentences in the same class (

μ_{e}

) and in the opposite class (

ϕ_{e}

); (ii) similar sentences in the same class (

μ_{t}

) and in the opposite class (

ϕ_{t}

). We also calculate the ratio metrics (

κ_{t}

and

κ_{e}

), as shown in Table 4.

4.7. Extracting Potential Spurious Patterns

This step identifies potential spurious patterns by analyzing how removing the patterns selected in the previous step affects specificity and sensitivity metrics. The process perturbs the set

D_{t e s t}

by removing each pattern p and then evaluates the impact of this removal on the predictions of

M_{(f, g)}

.

Initially, we calculate an initial set m of metrics, including specificity (

m . e s p

), sensitivity (

m . s e n

), true positives (

m . v p

), and true negatives (

m . v n

), based on the evaluation of

M_{(f, g)}

on

D_{t e s t}

. Then, for each class y and for each pattern

p \in P

, we create a perturbed version

D_{t e s t}^{'}

by removing p from

D_{t e s t}

. The model

M_{(f, g)}

is then evaluated on

D_{t e s t}^{'}

, producing a new set of metrics

m^{'}

. We represent the difference between the original metrics m and the perturbed metrics

m^{'}

as

δ

.

If

δ

indicates an improvement in specificity or sensitivity (

δ . e s p > 0

or

δ . s e n > 0

), we consider the pattern p potentially associated with a spurious correlation. In this case, we evaluate the impact of the pattern using the following metrics: the number of perturbed sentences (

ϱ_{p} = | δ . t p | + | δ . t n |

), the perturbation rate (

τ_{p} = | δ . e s p | + | δ . s e n |

), the number of correct predictions (

ϱ_{a} = δ . t p + δ . t n

), and the accuracy rate (

τ_{a} = δ . e s p + δ . s e n

), as shown in Table 4. The result of this step is a set of patterns p accompanied by the calculated metrics (

ϱ_{p}

,

τ_{p}

,

ϱ_{a}

,

τ_{a}

). We integrate these metrics with those from the previous step to determine the spuriousness degree of each pattern, as explained below.

4.8. Clustering Patterns

In this step, we apply unsupervised learning to group patterns into logical and interpretable clusters, identifying latent structures and facilitating the visualization of data distribution. We consider the distance of patterns to centroids as the main factor for cluster interpretation [34]. We hypothesize that patterns farther from centroids (potential outliers) are more likely to be spurious.

We define an algorithm composed of steps to perform clustering and organize the visualization of clusters. Each step of this algorithm can be applied with different unsupervised learning techniques, providing flexibility and independence from specific methods. The input metrics for the algorithm are derived from the concatenation of the metrics for common patterns selected in the steps described in Section 4.5, Section 4.6 and Section 4.7. Table 4 summarizes the metrics used in cluster formation, where p represents the pattern, c represents the class for which it is considered important, and the metrics include importance weights, frequencies, ratios between frequencies in different contexts (error sentences and similar sentences), and perturbation quantities and rates. The steps of the algorithm are normalization, dimensionality reduction, determining the number of clusters, clustering, and visualization.

In this study, we applied Standard Normalization to standardize variables with different scales, ensuring comparability while also being sensitive to outliers [43]. For dimensionality reduction, we used PCA due to its ability to preserve most of the variance in the data. We used the elbow method to determine the number of clusters, clearly balancing granularity and cohesion. We selected K-means because it aligns with hypotheses related to the distance of patterns to centroids [34] and has been applied to identify patterns in datasets [44]. For visualization, we used 3D plots to highlight the overall shape of clusters and outliers and bar charts to emphasize the distance of patterns from centroids.

This algorithm analyzes hidden structures without requiring labels and highlights correlations among the input metrics. As discussed in later sections, the approach enables interactive visualization and uncovers unexpected associations.

5. Model Configuration

This section presents the setup of the models described previously in Section 3. We employ Logistic Regression (LRG), Support Vector Machine with Stochastic Gradient Descent (SVM-SGD), and BERTimbau (BERT), as described below.

5.1. Text Representation

The models rely on two types of text representations: Term Frequency–Inverse Document Frequency (TF-IDF) and Word Embeddings (WE). To generate WE, the BERTimbau model, integrated with the Sentence-BERT (SBERT) framework, encodes each sentence into a dense vector optimized for sentence-level tasks. The embeddings remain fixed during training in the linear classifiers, serving as static input features. Since LRG and SVM do not adjust embeddings dynamically, they operate directly on these precomputed vectors. In contrast, BERT processes tokenized text and updates its embeddings through fine-tuning.

5.2. Linear Classifiers: LRG and SVM

The evaluation of LRG and SVM considers both TF-IDF and WE representations. The Bayesian Search algorithm optimizes hyperparameters, selecting the best values for C in LRG and

α

in SVM. A five-fold cross-validation strategy ensures robustness in model selection. Table 5 summarizes the configurations applied to these models.

The training process for LRG includes L2 regularization with a maximum of 1000 iterations. The SVM model applies a stochastic gradient descent (SGD) optimizer, ensuring efficient parameter updates.

5.3. Transformer-Based Classifier: BERT

The classification model fine-tunes BERTimbau for binary classification. Each sentence is split into smaller units (subword tokens), marked with special indicators ([CLS] and [SEP]), adjusted for length, and converted into numbers that the model can process. The model processes the tokenized input through transformer layers, extracting contextualized embeddings that capture semantic relationships between words. A fully connected classification layer receives these embeddings and applies a softmax activation function to compute the probability distribution over the two target classes. The model assigns the final class label based on the highest probability.

The training process optimizes model weights using the Adam optimizer with a learning rate of 2

\times 10^{- 5}

. The training runs for three epochs with a batch size of 4. The training includes dropout regularization (0.1), early stopping, and model checkpointing to improve generalization and prevent overfitting. The model selection criterion prioritizes the highest F1-score during validation. Table 6 presents the BERT configuration details.

We executed the experiments in Python 3.9 using the Scikit-Learn 1.4.1, Torch 2.1.1, and Transformers 4.35.2 libraries, along with the pre-trained BERTimbau model [27].

6. Experiments and Results

This section presents the results of our analyses to evaluate the proposed method. We conducted experiments using 20 combinations of models, text representation methods (TF-IDF and WE), datasets (Contracts and Bidding), and classes (0 and 1). Each combination followed the configurations detailed in Section 5 to ensure consistency in the evaluation, but we only detailed one cluster-interpreting scenario. This choice serves two purposes: (i) presenting all combinations would unnecessarily lengthen the document and introduce redundancy, reducing clarity; (ii) the selected scenario represents the general trends observed across all results and effectively illustrates the study’s hypotheses and objectives.

The preprocessing and Train/Test Model steps revealed direct relationships between the dataset characteristics, text representation, and model performance. Table 7 summarizes the results obtained for the Contracts and Bidding datasets.

In the Contracts dataset, which exhibits lower lexical diversity (5662 distinct words) and shorter sentences (an average of 30.5 words), the performance of the models varied depending on the text representation. When using TFIDF, the linear models (LRG, SVM) achieved high accuracies of 94.58% and 94.20%, with sensitivities of 94.24% and 94.34% and specificities of 94.93% and 94.90%, respectively. However, with Word Embeddings, both models experienced a drop in performance: LRG reached 84.23% accuracy, 83.19% sensitivity, and 85.27% specificity, while SVM obtained 86.03% accuracy, 81.48% sensitivity, and 90.58% specificity. The BERT model, which inherently uses embeddings, maintained a strong performance, achieving 94.97% accuracy, 95.39% sensitivity, and 94.93% specificity. We observed a similar trend in the Bidding dataset, which has higher lexical diversity (7960 distinct words) and longer sentences (an average of 43.95 words). With TFIDF, the linear models performed well, reaching 97.92% (LRG) and 97.52% (SVM) accuracy, with sensitivities of 97.88% and 97.52% and specificities of 97.96% and 97.54%, respectively. However, with Word Embeddings, their performance declined: LRG achieved 93.16% accuracy, 93.71% sensitivity, and 92.63% specificity, while SVM obtained 93.80% accuracy, 94.36% sensitivity, and 93.26% specificity. The BERT model, leveraging its native embeddings, continued to perform well, with 97.58% accuracy, 97.61% sensitivity, and 97.55% specificity.

BERT effectively handles data with high lexical diversity and complex textual structures, maintaining strong performance across both datasets. LRG and SVM perform well with TFIDF in datasets with constrained vocabulary and shorter sentences but decline with Word Embeddings, struggling to fully exploit their richer semantic information.

The model error analysis highlighted specific patterns that contributed to misclassifications. Table 8 presents the most representative sentence for each model and text representation in the Contracts dataset, influencing errors in class Other procurements (0). The method identifies the representative sentence as the most frequently occurring similar sentence in the opposite class. The SVM and LRG models with Word Embeddings tended to confuse sentences related to procurement. In particular, the SVM model frequently misclassified sentences *, suggesting that the embedding-based representation captured semantic relationships between different procurement categories but failed to correctly distinguish between them.

The BERT model struggled with sentences related to service contracting. The most representative error sentence ^† indicates that the model may confuse similar contracts involving different types of products and services. On the other hand, the SVM and LRG models with TFIDF exhibited errors associated with frequent words. For instance, the LRG model had difficulty with sentences such as ^¶, suggesting that highly recurrent terms like “contrato” (contract), “aquisição” (acquisition), and “municipal” (municipal) led to an overestimation of their importance.

These findings reinforce the need for further steps in the pipeline to mitigate these issues, particularly by identifying and addressing spurious correlations that mislead the models.

In the “Extract Important Words” step, we applied LIME for SVM (using TFIDF and Word Embeddings) and for BERT, given its strong performance in similar datasets [22]. For LRG, we used intrinsic coefficients to assess word importance. The analysis revealed differences in word importance assignment across models with distinct text representations. Figures illustrate how importance and frequency correlate for the top 20 words in SVM with TF-IDF and Word Embeddings, focusing on the “Other procurements” (0) class of the Contracts dataset (C).

SVM with TFIDF (Figure 3a) prioritized frequent words like “contratacao” (contracting) and “servicos” (services), highlighting their reliance on word frequency over semantic meaning. With Word Embeddings (Figure 3b), shifted slightly, incorporating less frequent but contextually relevant terms like “social” (social) and “limpeza” (cleaning), which, despite being less frequent, remain closely related to public procurement contexts, though frequency bias remained dominant, the LRG model exhibited a similar tendency. BERT, using contextual embeddings (Figure 3c), emphasized less frequent but semantically meaningful words like “totem” (totem) and “dispensador” (dispenser), which were commonly purchased during the COVID-19 pandemic but were not exclusive to healthcare.

These results align with structural differences: linear models prioritize frequency, while BERT identifies semantic relationships. Word Embeddings in linear models capture some context but remain frequency-driven.

We do not present the results from the “Extract Investigation Patterns” and “Extract Potential Spurious Patterns” steps individually, as these results are closely integrated with the “Cluster Patterns” step. Presenting them separately would introduce redundancy and a lack of integration. We present the “Clustering Patterns” results applied to the SVM model, which uses TFIDF, on the Contracts dataset for the class “Other procurements” and compare them with SVM using Word Embeddings and BERT. These results reflect the trend observed. In these steps, we used

m = 50

, corresponding to the top 50 most important words extracted in the previous step,

λ = 0.85

, to ensure the selection of similar sentences in both classes, and

n = 3

, limiting the maximum pattern size to three words.

Figure 4 shows the heatmap illustrating the contribution of each metric to the three principal components based on PCA weights. Warmer colors (red shades) indicate positive contributions, with higher intensity representing a stronger influence on the component. Cooler colors (blue shades) represent negative contributions, with darker shades indicating a stronger negative influence. Neutral or near-zero values are shown in light colors, signifying a minimal impact.

6.1. Interpreting PCA and Clusters in the “Other Acquisitions” Class

For PC1, the variables with the highest contributions are

μ_{t}

(frequency in similar sentences from the same class) with 0.37,

ϕ_{t}

(frequency in similar sentences from the opposite class) with 0.38,

μ_{e}

(frequency in error sentences from the same class) with 0.34,

ϕ_{e}

(frequency in error sentences from the opposite class) with 0.38,

τ_{p}

(perturbation rate) with 0.37, and

ϱ_{p}

(perturbation count) with 0.37. These metrics indicate that PC1 is strongly associated with the frequency of patterns across different contexts (training and error) and the perturbations applied to the dataset. The proximity of the load values suggests that PC1 captures an integrated behavior between frequency metrics and the effects of perturbations, revealing the model’s sensitivity to highly frequent patterns in training and testing and highlighting how perturbations directly influence frequency-related metrics.

In PC2, the most influential metrics are

κ_{t}

(the ratio of frequencies in similar sentences from opposite classes and the same class) with −0.46,

κ_{e}

(the ratio of error frequencies from opposite classes and the same class) with 0.51,

τ_{a}

(the accuracy improvement rate after perturbation) with 0.45, and

ϱ_{a}

(the number of corrected predictions) with 0.44. PC2 captures the relative impact of correlations between classes and the performance gains achieved by removing the analyzed patterns. The presence of

κ_{e}

and

κ_{t}

suggests that PC2 reflects the model’s reliance on patterns influencing both classes, while

τ_{a}

and

ϱ_{a}

indicate how these relationships can be leveraged to correct incorrect predictions. The balance between proportion metrics and correction metrics highlights the interaction between relative frequency and adjusted performance.

In PC3, the relevant metrics are

μ_{e}

with −0.44,

ρ

(the global weight of the pattern) with 0.38, and

κ_{e}

with 0.37. This principal component focuses on the influence of specific patterns with high frequencies in error sentences and their global importance. The simultaneous presence of

μ_{e}

and

ρ

indicates that PC3 captures how individually important patterns contribute to classification errors, reflecting a more localized behavior in the error space.

In summary, (i) PC1 reflects the influence of pattern frequencies and perturbations on predictions; (ii) PC2 captures the relationships between relative frequencies and performance gains from pattern removal; and (iii) PC3 emphasizes the impact of specific patterns on classification errors. Table 9 presents the contributions of metrics to the principal components (PC1, PC2, and PC3) for SVM-C0.

The cluster analysis tested the hypothesis that patterns farther from centroids have a higher potential for spuriousness due to their variability. To validate this, we analyzed the patterns within each cluster based on their projections (distance to the centroid) in the principal components (PC1, PC2, and PC3) and their respective metrics. Figure 5 presents a three-dimensional plot of the clustering result obtained using K-means with k = 2 (determined from the elbow method for the three components).

6.1.1. Cluster 0 Analysis

Closest Pattern: The pattern “empresa” (company) has a distance of 0.192. This pattern shows balanced projections (relative to the centroid) across the three principal components, aligning with the average characteristics of the cluster. In PC1 (3.02), associated with frequency metrics, the pattern reflects moderate values for

μ_{t}

(frequency in similar sentences from the same class) and

ϕ_{t}

(frequency in similar sentences from the opposite class), indicating low frequency in similar sentences. In PC2 (0.95), which captures ratios between frequencies, the pattern shows consistent values for

κ_{t}

(the ratio of frequencies in similar sentences) and

κ_{e}

(the ratio of frequencies in error sentences), demonstrating stability in the relationship between frequencies of similar and error sentences. Finally, in PC3 (0.72), related to global weights and frequencies in error sentences,

ρ

(global weight of the pattern) exerts a moderate influence, indicating alignment with the average characteristics of the cluster.

Most Distant Patterns: The most distant patterns from the centroid displayed extreme projections. The pattern “servicos” (services) with a distance of 4.062, showed a high projection in PC1 (3.02) due to high frequencies in

ϕ_{t}

(frequency in similar sentences from the opposite class) and

ϕ_{e}

(frequency in error sentences from the opposite class). In PC2 (3.76), this pattern depended on ratio metrics, mainly

κ_{e}

(ratio of frequencies in error sentences), indicating significant influence in error contexts. In PC3 (

1.19

), the global weight (

ρ

) reinforced the model’s reliance on this pattern. The pattern “contratacao empresa municipio” (contracting company municipality) with a distance of 4.270, exhibited high projections in PC3 (3.16), driven by

ρ

(the global weight of the pattern) and

ϕ_{e}

(the frequency in error sentences from the opposite class). In PC1 and PC2, the values were moderate, suggesting that the pattern’s influence is primarily concentrated in errors.

6.1.2. Cluster 1 Analysis

Closest Pattern: The pattern “materiais” (company contracting material) shows a distance of 0.455 and typical values across the three principal components. In PC1 (−1.60), associated with frequency metrics, the pattern reflects typical values for

μ_{t}

(frequency in similar sentences from the same class) and

ϕ_{t}

(frequency in similar sentences from the opposite class). In PC2 (−0.28), related to frequency ratios and corrections, the pattern exhibits projections consistent with

κ_{t}

(the ratio of frequencies in similar sentences) and

κ_{e}

(the ratio of frequencies in error sentences), indicating low variability in these contexts. Finally, in PC3 (−0.03), which captures global weights and frequencies in errors, the pattern is influenced by

ρ

(the global weight of the pattern), reinforcing its alignment with the cluster’s average characteristics.

Most Distant Patterns: The most distant patterns, “prestacao municipio” (provision municipal), with a distance of 3.33, and “servicos municipio” (municipality services), with a distance of 3.96, exhibited extreme projections. The pattern “prestacao municipio” (provision municipal) stood out in PC2 (3.33) due to the strong influence of

κ_{t}

(the ratio of frequencies in similar sentences) and

κ_{e}

(the ratio of frequencies in error sentences). The pattern “servicos municipio” (municipality services) showed high values in PC2 (3.68), again influenced by

κ_{e}

, and in PC3 (1.19), related to

ρ

(global weight of the pattern), highlighting its relevance in classification errors and indicating greater variability and potential spuriousness.

6.1.3. Visualization of Distances to Centroids

The evaluation of the relationship between components PC1, PC2, and PC3 provides a comprehensive perspective on the representativeness of patterns and their potential for spuriousness. However, the distance of patterns to centroids can serve as a metric to assess this potential. Patterns located farther from the centroids tend to be less representative of the cluster and, therefore, have a higher probability of being spurious correlations, as demonstrated in previous examples. This distance reflects the degree of deviation of a pattern from the typical cluster behavior, indicating its inconsistency within the grouping [34].

In the SVM-TFIDF (Figure 6), the closest patterns in Cluster 0, “empresa” (company) and “contratacao empresa” (contracting company), exhibit low global weight (

ρ

) and reduced variations in the frequency ratios between classes (

κ_{t}

) and between errors (

κ_{e}

), indicating their stable occurrence. Conversely, the most distant patterns, “contratacao empresa municipio” (contracting company municipality) and “servicos” (services), have high

ρ

values and significant discrepancies between

μ_{t}

and

ϕ_{t}

, suggesting their presence may bias classification. In Cluster 1, the closest patterns, “contratacao empresa materiais” (contracting company materials) and “contratacao materiais” (contracting materials), exhibit lower oscillation in the model influence indices (

τ_{p}

,

τ_{a}

), while the most distant ones, “servicos municipio” (municipal services) and “prestacao municipio” (municipal provision), show significant differences between

μ_{e}

and

ϕ_{e}

, indicating a strong model dependency on these patterns.

In the SVM-WE (Figure 7), close patterns in Cluster 0, such as “contrato prestacao” (contract provision) and “presente servicos municipio” (present municipal services), exhibit moderate

ρ

values and homogeneous distribution among classes, with slight variations in

κ_{t}

and

κ_{e}

. In contrast, the most distant patterns, “contratacao” (contracting) and “servicos necessidade municipio” (municipal services need), have high

ρ

values and significant discrepancies in the error ratio between classes, suggesting the model may use them in a biased manner. In Cluster 1, the closest patterns, “servicos” (services) and “municipio” (municipality), have moderate impact on model decisions, while the most distant ones, “contratacao” (contracting) and “servicos municipio” (municipal services), show high

ρ

values and significant variations in

τ_{p}

, indicating potential model overreliance on them.

In the BERT (Figure 8), close patterns in Cluster 0, “contratacao empresa saude” (contracting company health) and “contratacao secretaria” (contracting secretariat), present low

τ_{p}

and

τ_{a}

values, suggesting they are less determinant for classification. Conversely, the most distant patterns, “contratacao empresa” (contracting company) and “empresa” (company), have high

ρ

values and large oscillations between

μ_{t}

and

ϕ_{t}

, reinforcing the hypothesis that they are spurious patterns. In Cluster 1, the closest patterns, “contratacao saude” (contracting health) and “contratacao saude municipio” (contracting health municipality), exhibit balanced distribution among classes. In contrast, the most distant ones, “servicos” (services) and “contratacao” (contracting), show the highest variations in

κ_{t}

and

κ_{e}

, suggesting their use in a non-causal manner by the model.

The analysis of distances to the centroid reveals closer patterns that exhibit balanced distribution among classes and have a lower impact on model errors. In contrast, more distant patterns show high variation in frequency metrics and a more significant influence on model errors and perturbations. The analysis also shows that smaller clusters in each model and text representation contain patterns with higher spuriousness potential, characterized by greater average distance to the centroid, higher global weight (

ρ

), more significant variation between training and error (

κ_{t}

,

κ_{e}

), and a discrepancy between error and perturbation. The pattern “contratacao” (contracting), found in the most minor clusters, is highly frequent in the dataset. However, its asymmetric distribution between classes leads the model to use it as a shortcut for classification. Additionally, its high global weight and large oscillations in perturbation metrics indicate that it is a determining factor in predictions, even when no clear semantic relationship exists with the actual class of the sentence.

Patterns with greater distances from the centroid demonstrate significant deviations from the typical behavior observed within the cluster. These patterns exhibit higher variability, with extreme projections in metrics associated with frequencies and global weights. Consequently, they present a higher likelihood of being classified as spurious. In contrast, patterns closer to the centroid align more consistently with the average metrics of the cluster, reflecting typical cluster characteristics and reduced variability. The analysis validates the hypothesis, as input metric evaluations indicate that closer patterns display a lower degree of dependency, corresponding to a reduced risk of spuriousness compared to the more distant patterns.

6.2. Potential Spurious Patterns Common to Models

Evaluating the relationship between components PC1, PC2, and PC3 provides a comprehensive view of patterns’ representativeness and spuriousness potential. However, the distance of patterns to centroids can serve as a metric for assessing this potential. This distance reflects the degree of deviation of a pattern from the typical behavior of the cluster, indicating its inconsistency within the grouping [34]. Following this, we analyze patterns identified as potentially spurious by at least three model–text-representation combinations in the datasets. Unlike a simple frequency-based approach, this analysis considers the actual context in which these patterns appear, assessing their influence on classification errors.

Potential Spurious Patterns in the Contracts and Bidding Datasets

Table 10 presents the most common patterns far from their centroids across models in the Contracts dataset. These patterns appear in varied contract descriptions, making them unreliable indicators of class distinction. In class 0, “municipio” (municipality) is frequently mentioned in administrative contexts unrelated to the class’s context, leading to misclassification. Similarly, “servicos” (service), “contratacao” (contracting), and “contratacao empresa” (company contracting) occur in multiple contract types, confusing models. In class 1, “aquisicao” (acquisition) is commonly found in procurement descriptions across different areas, making it an unstable predictive feature. “Fornecimento” (supply) and “covid” appear disproportionately in healthcare-related contracts but do not always define the contract type.

In the Bidding dataset, as presented in Table 11, patterns such as “services” (services) and “contratacao” (contracting) are identified as potentially spurious in class 0. These terms appear broadly in different bidding object descriptions, leading to misclassifications. For instance, “contratacao empresa especializada” (contracting specialized company) is used in services and goods procurement, making it unreliable for class differentiation. In class 1, “aquisicao” (acquisition) and “fornecimento referencia” (supply reference) frequently occur in multiple contexts, demonstrating their instability as classification features.

The results confirm that frequent and generic patterns have a higher potential to be spurious, especially when they appear in different contexts and contribute to model errors. Just like the term “Spielberg”, the patterns “servicos” (services), “aquisicao” (acquisition), and “fornecimento” (supply) occur both in similar sentences within the same class and in the opposite class, as well as in error sentences, influencing incorrect predictions. These terms repeatedly appear in different scenarios, and the model interprets them as strong predictors, even without a causal relationship with the class. Moreover, since they are farther from the centroids, as analyzed in previous sections, they exhibit greater variability, reinforcing this tendency and increasing the risk of spurious association.

7. Experiment on the IMDB Dataset and Comparison with Reference Methods

This section evaluates the proposed method on the IMDB dataset, a widely used benchmark for sentiment classification in textual data. The dataset contains 50,000 movie reviews from the Internet Movie Database (IMDB), with 25,000 labeled as positive and 25,000 as negative [45]. Researchers originally developed this dataset to train and test sentiment analysis models. This study applies the proposed approach to identify spurious patterns and measure its robustness by comparing its performance with reference methods.

Table 12 presents 25 patterns identified as potentially spurious for each class. The selection includes the five patterns farthest from the cluster centroids and detected by at least four models; the Distance column represents the average distance of the patterns from the centroid. According to the result of the “Extracting Potential Spurious Patterns” step, these patterns influence the models to classify test sentences into the associated class, regardless of the context. Their presence in training sentences can distort the classification, creating associations that do not represent genuine semantic features of the text.

Analyzing these patterns helps reveal biases in NLP models and assess if certain words act as decision shortcuts, ignoring sentence context. For illustration, we examine “movie” and “best”, identified as spurious by at least four models.

Spurious Pattern for Class 0: “Movie”

The term “movie” frequently appears in reviews as a neutral element, referring to the subject of analysis (the film). However, its high occurrence in Class 0 sentences suggests that models assign an undue negative influence to its presence. For example, the sentence “I do not care if some people voted this movie to be bad. If you want the truth, this is a very good movie. It has everything a movie should have. You really should get this one.” expresses a positive opinion about the film, contradicting the assigned negative classification. This behavior suggests that the model correlates the mere presence of the word “movie” with the negative class without considering the overall text semantics.

Additionally, when comparing “movie” with a similar pattern, “movie plot”, many sentences received a Class 0 label solely because they contained this pattern, even when the context positively evaluated the plot. This observation suggests that “movie” and “plot” together should not determine classification. For example, the text, “Note these comments are for people who have seen the movie. Vanilla Sky is a brilliant, complex, and thrilling movie that existentially explores exactly what the tagline says: love, hate, dreams, life, work, play, and friends. Maybe the movie plot can come into focus for confused moviegoers if one looks at it from a different angle, considering the following… This movie becomes a fascinating exploration of a human on a journey to find himself and what that means in today’s pop culture society.” received a Class 0 (negative) label just because it contained “movie plot”, despite including praise such as “brilliant, complex, and thrilling movie” and “fascinating exploration.” This result confirms that the combination “movie plot” may act as a shortcut for negative classification when it should not play a decisive role in the model’s decision.

Spurious Pattern for Class 1: “Best”

The term “best” usually appears in positive reviews, making it expected in Class 1 sentences. However, its isolated presence can lead to incorrect classifications when used without context. For example, the sentence “The first part of Grease with John Travolta and Olivia Newton-John is one of the best movies for teens. This one is a very bad copy. The change is only in the sex. In the first one, the good one was Sandy. Here it is Michael. I prefer to watch the first Grease.” contains the term “best”, but expresses a negative critique by unfavorably comparing the analyzed movie to another. Assigning this sentence to Class 1 suggests that the model captures only the presence of the term “best” and ignores the complete argument. The comparison with the pattern “best good” reveals a similar behavior. Many sentences containing this word combination received incorrect classifications as Class 1 solely due to their presence, even when the text expressed criticism; the text “I debated quite a bit over what rating to give this one because it is my least favorite Herschell Gordon Lewis film so far other than The Gruesome Twosome, but it has the best acting I have seen in a Lewis film. However, we all know that is not saying much. Once the movie was done, I was happy because it felt like I had been sitting through a 4-hour movie, though it was only 82 min long. I am trying to see all of HGL’s films, which is probably the only reason to see this one. The gore is good as usual—the one thing that Herschell seemed to get right. The acting is just as bad as usual with one exception. That exception is Frank Kress. Now, would I say that he is a good actor? No way. But he is good compared to everyone else. The story is boring and flat and goes nowhere, and by the end, I did not care what happened, just so long as it ended”. contains the words “best” and “good” but expresses a predominantly negative evaluation of the film.

The author highlights that although the film presents the “best acting” within the director’s body of work, this does not imply that the acting is good. Additionally, the critique emphasizes that the film is “boring and flat”, reinforcing that the positive classification does not reflect the actual intent of the review. This case demonstrates that the mere presence of terms associated with positivity can lead the model to spurious correlations, resulting in inaccurate classifications.

8. Comparison with Reference Methods

We compared our method with three reference approaches. Wang and Culotta [7], Wang and Culotta [4], and Yadav et al. [16], by analyzing the results of experiments conducted on the IMDB dataset. Table 13 summarizes the main methodological differences. Wang and Culotta [7] automatically generate counterfactuals by replacing words with antonyms to assess model robustness. Although this approach identifies spurious patterns, it is limited to isolated words and requires human intervention to annotate spurious and causal features and evaluate the counterfactuals. Wang and Culotta [4] rely on causal inference, analyzing words with spurious influence through sentence matching. However, the need for an initial manual labeling of words as genuine or spurious constrains its applicability. Finally, Yadav et al. [16] use Tsetlin Machines to learn interpretable patterns through logical rules. However, the reliance on manually crafted counterfactuals and the restriction to unigrams limit its generalization. Table 13 summarizes the key differences between these methods.

In addition to the common words identified by our method, as shown in Table 14, only our approach can detect compound patterns, such as “movie plot” and “bad movie plot”. These patterns provide a significant advantage in analyzing spurious correlations, revealing how models can learn non-causal shortcuts. For example, the inclusion of the term “movie” in the compound pattern “movie plot” suggests that the model may associate negative reviews with the joint presence of these words rather than the isolated meaning of “plot”. The detection of “bad movie plot” reinforces this hypothesis, indicating that the combination of terms may serve as a superficial cue for classifying a film as bad without considering the actual content of the review.

The reliance on human intervention varies among the compared methods. The counterfactual-based method [7] requires manual validation to ensure that words replaced with antonyms genuinely represent spurious patterns. The causal inference approach [4] depends on human annotators manually labeling influential words, introducing subjectivity into the process. In the Tsetlin Machine-based method [16], manually creating counterfactuals is essential to refine the logical rules learned by the model. Our method eliminates this requirement by automating the identification and quantification of spurious patterns. XAI-based explainability extracts important words directly from the models, while clustering and perturbation analysis quantify pattern influence. This approach makes spurious pattern detection more scalable and adaptable to different domains. However, as in any scientific study, analyzing the results may involve domain experts interpreting detected patterns and validating relevant findings. This step is not an operational dependency of the method but rather a common practice to ensure the robustness and applicability of conclusions in specific contexts.

9. Conclusions

This study proposes an agnostic method that combines XAI and unsupervised learning to detect, interpret, and quantify spurious patterns in binary NLP classification. Unlike existing methods, our approach introduces an automated, data-driven framework that identifies spurious correlations without the prior selection of candidate patterns. This flexibility allows the method to generalize across different datasets and classification models. The following items present the key innovations and contributions of this work. Methodological Innovations:

Automatic detection of composite patterns: Identifies complex spurious patterns without predefined candidates or human intervention.
Centroid-distance metric: Measures model dependence on spurious features and improves cluster interpretability.

Empirical Contributions:

Model-agnostic detection: Applies to different classifiers without requiring model-specific adaptations.
Flexible pipeline: Allows for multiple feature importance and clustering techniques.
Datasets and results: Public Bidding and Contracts datasets and results are available for future comparisons.

Discussion: The results demonstrate that the proposed method effectively identifies spurious patterns across different classification models and datasets, reinforcing the hypothesis that patterns farther from cluster centroids exhibit a higher potential for spuriousness. Compared to traditional approaches that rely on predefined assumptions about feature importance, our method provides a fully automated and model-agnostic framework, reducing human intervention in the identification process. However, the dependence on clustering stability raises concerns about sensitivity to dataset-specific properties, particularly in scenarios where spurious patterns are more evenly distributed. While K-means effectively segments spurious and non-spurious patterns, its reliance on Euclidean distances may not fully capture the complexity of contextual relationships in textual data, suggesting that alternative clustering methods should be explored. Additionally, our findings align with previous studies indicating that linear models tend to overemphasize frequent patterns, while BERT captures more nuanced dependencies, yet still inherits biases from training data. These findings highlight the influence of explainability techniques on model bias detection. Since different XAI methods may assign varying importance to features, future research should investigate whether combining multiple interpretability techniques enhances the robustness of spurious pattern identification.

Additionally, we evaluate the methodology of the IMDB dataset, comparing its performance with existing approaches to assess generalization across different text classification tasks. The experimental setup introduces certain limitations that do not arise from the methodology itself, given its model-agnostic nature:

Sensitivity to K-means configurations: Identifying spurious patterns relies on K-means clustering, but determining the optimal number of clusters using the elbow method can be unstable due to data variations.
Reliance on XAI models: Using LIME and intrinsic coefficients may introduce biases in the assignment of word importance.
Limitations on pattern size: The experiment analyzes patterns of up to three words (n = 3) and does not cover longer dependencies, potentially underestimating some spurious correlations.
Visualization constraints: Pattern validation relies on the manual analysis of tables and graphs, limiting expert assessment.

However, the methodology was designed to accommodate such adaptations, ensuring flexibility for future improvements. Building on these contributions, future research will explore the following.

1.: Refinement of the clustering step: Investigating additional unsupervised learning methods, including DBSCAN, Agglomerative Clustering, Hierarchical Clustering, and direct outlier detection techniques, to improve the detection of spurious patterns.
2.: Enhancement of explainability techniques: Comparing different XAI frameworks to assess their impact on the selection of spurious patterns and reduce potential biases in explainability methods.
3.: Scaling to longer pattern dependencies: Extending the analysis beyond three-word sequences to capture more complex spurious correlations in NLP models.
4.: Development of interactive visualization tools: Creating interfaces that allow domain experts to dynamically explore detected patterns and validate their impact on model decisions.

Additionally, this approach has potential applications beyond NLP. The ability to quantify model dependency on features without requiring labeled spurious patterns suggests its applicability in structured data analysis, feature selection, and explainability-driven model debugging. Future work will explore these possibilities and assess the method’s effectiveness in broader machine learning applications.

Author Contributions

Conceptualization, R.S.M.; Methodology, H.d.A.S.; Validation, V.P.M. and A.P.; Resources, W.L.; Supervision, R.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, L.; Chen, Y.; Cui, G.; Gao, H.; Zou, F.; Cheng, X.; Ji, H.; Liu, Z.; Sun, M. Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evaluations. Adv. Neural Inf. Process. Syst. 2023, 36, 58478–58507. [Google Scholar]
Gautam, S.; Höhne, M.M.-C.; Hansen, S.; Jenssen, R.; Kampffmeyer, M. This Looks More Like That: Enhancing Self-explaining Models by Prototypical Relevance Propagation. Pattern Recognit. 2023, 136, 109172. [Google Scholar]
Keith, K.A.; Jensen, D.; O’Connor, B. Text and Causal Inference: A Review of Using Text to Remove Confounding from Causal Estimates. arXiv 2020, arXiv:2005.00649. [Google Scholar]
Wang, Z.; Culotta, A. Identifying Spurious Correlations for Robust Text Classification. In Proceedings of the ACL Conference, Online, 5–10 July 2020. [Google Scholar]
Chew, O.; Lin, H.T.; Chang, K.W.; Huang, K.H. Understanding and Mitigating Spurious Correlations in Text Classification with Neighborhood Analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Miami, FL, USA, 12–16 November 2024. [Google Scholar]
Kumar, A.; Deshpande, A.; Sharma, A. Causal Effect Regularization: Automated Detection and Removal of Spurious Correlations. Adv. Neural Inf. Process. Syst. 2023, 36, 20942–20984. [Google Scholar]
Wang, Z.; Culotta, A. Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; García, S.; Gil-López, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
Wang, T.; Sridhar, R.; Yang, D.; Wang, X. Identifying and Mitigating Spurious Correlations for Improving Robustness in NLP Models. arXiv 2021, arXiv:2110.07736. [Google Scholar]
Plumb, G.; Ribeiro, M.T.; Talwalkar, A. Finding and Fixing Spurious Patterns with Explanations. arXiv 2021, arXiv:2106.02112. [Google Scholar]
Chou, Y.-L.; Moreira, C.; Bruza, P.; Ouyang, C.; Jorge, J. Counterfactuals and Causability in Explainable Artificial Intelligence: Theory, Algorithms, and Applications. Inf. Fusion 2022, 81, 59–83. [Google Scholar]
Kattakinda, P.; Feizi, S. Focus: Familiar Objects in Common and Uncommon Settings. arXiv 2021, arXiv:2110.03804. [Google Scholar]
Wu, T.; Ding, X.; Tang, M.; Zhang, H.; Qin, B.; Liu, T. NoisywikiHow: A Benchmark for Learning with Real-World Noisy Labels in Natural Language Processing. arXiv 2023, arXiv:2305.10709. [Google Scholar]
Wu, Y.; Gardner, M.; Stenetorp, P.; Dasigi, P. Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets. arXiv 2022, arXiv:2203.12942. [Google Scholar]
Veitch, V.; D’Amour, A.; Yadlowsky, S.; Eisenstein, J. Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests. arXiv 2021, arXiv:2106.00545. [Google Scholar]
Yadav, R.K.; Lei, J.; Granmo, O.-C.; Goodwin, M. Robust Interpretable Text Classification Against Spurious Correlations Using AND-Rules with Negation. In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022. [Google Scholar]
Liu, C.; Gan, L.; Kuang, K.; Wu, F. Investigating the Robustness of Natural Language Generation from Logical Forms via Counterfactual Samples. arXiv 2022, arXiv:2210.08548. [Google Scholar]
Zhang, Y.; Pan, L.; Tan, S.; Kan, M.-Y. Interpreting the Robustness of Neural NLP Models to Textual Perturbations. arXiv 2021, arXiv:2110.07159. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, CoRR, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Cardozo, S.; Montero, G.I.; Kazhdan, D.; Dimanov, B.; Wijaya, M.; Jamnik, M.; Lio, P. Explainer Divergence Scores (EDS): Some Post-Hoc Explanations May Be Effective for Detecting Unknown Spurious Correlations. arXiv 2022, arXiv:2211.07650. [Google Scholar]
Soares, H.; Veras, R.; Moura, R.; Paiva, A. Using Explainability to Find Spurious Patterns in Textual Datasets. In Proceedings of the International Conference on Intelligent Systems Design and Applications, Online, 11–13 December 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 424–434. [Google Scholar]
Serrano, S.; Dodge, J.; Smith, N.A. Stubborn Lexical Bias in Data and Models. arXiv 2023, arXiv:2306.02190. [Google Scholar]
Eisenstein, J. Natural Language Processing; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Bountakas, P.; Koutroumpouchos, K.; Xenakis, C. A Comparison of Natural Language Processing and Machine Learning Methods for Phishing Email Detection. In Proceedings of the 16th International Conference on Availability, Reliability and Security, Vienna, Austria, 17–20 August 2021; pp. 1–12. [Google Scholar]
Li, X.; Orabona, F. On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Okinawa, Japan, 16–18 April 2019; pp. 983–992. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Salton, G.; Buckley, C. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
Salsabila, G.D.; Setiawan, E.B. Semantic approach for big five personality prediction on Twitter. J. RESTI (Rekayasa Sist. Dan Teknol. Informasi) 2021, 5, 680–687. [Google Scholar] [CrossRef]
Anders, C.J.; Weber, L.; Neumann, D.; Samek, W.; Müller, K.-R.; Lapuschkin, S. Finding and Removing Clever Hans: Using Explanation Methods to Debug and Improve Deep Models. Inf. Fusion 2022, 77, 261–295. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
Bro, R.; Smilde, A.K. Principal Component Analysis. Anal. Methods 2014, 6, 2812–2831. [Google Scholar] [CrossRef]
Syakur, M.A.; Khotimah, B.K.; Rochman, E.M.S.; Satoto, B.D. Integration of K-Means Clustering Method and Elbow Method for Identification of the Best Customer Profile Cluster. IOP Conf. Ser. Mater. Sci. Eng. 2018, 336, 012017. [Google Scholar] [CrossRef]
Daud, H.B.; Zainuddin, N.B.; Sokkalingam, R.; Museeb, A.; Inayat, A. Addressing Limitations of the K-Means Clustering Algorithm: Outliers, Non-Spherical Data, and Optimal Cluster Selection. AIMS Math. 2024, 9, 25070–25097. [Google Scholar]
Pezeshkpour, P.; Jain, S.; Singh, S.; Wallace, B.C. Combining Feature and Instance Attribution to Detect Artifacts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
Vale, A.H.; Santos, P.; Soares, H.; Moura, R.S. Automatic Classification of Public Expenses in the Fight against COVID-19: A Case Study of TCE/PI. In Proceedings of the XIX Brazilian Symposium on Information Systems, Maceió, Brazil, 29 May–1 June 2023; pp. 221–228. [Google Scholar]
Lima, W.; Lira, R.; Paiva, A.; Silva, J.; Silva, V. Method for Automatic Extraction of Red Flags in Public Procurement. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; pp. 1–7. [Google Scholar]
Lima, W.; Silva, V.; Silva, J.; Lira, R.; Paiva, A. BidCorpus: A multifaceted learning dataset for public procurement. Data Brief 2025, 58, 111202. [Google Scholar] [CrossRef]
Cohn, D.; Atlas, L.; Ladner, R. Improving Generalization with Active Learning. Mach. Learn. 1994, 15, 201–221. [Google Scholar] [CrossRef]
Bischl, B.; Binder, M.; Lang, M.; Pielok, T.; Richter, J.; Coors, S.; Thomas, J.; Ullmann, T.; Becker, M.; Boulesteix, A.-L.; et al. Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2023, 13, 2. [Google Scholar] [CrossRef]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Ali, S.; Abuhmed, T.; El-Sappagh, S.; Muhammad, K.; Alonso-Moral, J.M.; Confalonieri, R.; Guidotti, R.; Del Ser, J.; Díaz-Rodríguez, N.; Herrera, F. Explainable Artificial Intelligence (XAI): What We Know and What Is Left to Attain Trustworthy Artificial Intelligence. Inf. Fusion 2023, 99, 101805. [Google Scholar]
Han, J.; Pei, J.; Tong, H. Data Mining: Concepts and Techniques; Morgan Kaufmann: Burlington, MA, USA, 2022. [Google Scholar]
Sun, Z.; Zong, Q.; Mao, Y.; Wu, G. Exploring the features and trends of industrial product e-commerce in China using text-mining approaches. Information 2024, 15, 712. [Google Scholar] [CrossRef]
Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg, PA, USA, 19–24 June 2011; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 142–150. [Google Scholar]

Figure 1. Pipeline: Preprocessing; Train/Test Model; Extract Important Words; Analyze Errors; Extract Investigation Patterns; Extract Potential Spurious Patterns; and Cluster Patterns.

Figure 2. Similarity relationship between the error sentence

s_{e}

and similar sentences

s_{r}

. The similarity

s i m (s_{e}, s_{r})

indicates the degree of similarity between

s_{e}

and

s_{r}

. Similar sentences from the same class as

s_{e}

are highlighted in orange, while sentences from the opposite class are in blue.

Figure 2. Similarity relationship between the error sentence

s_{e}

and similar sentences

s_{r}

. The similarity

s i m (s_{e}, s_{r})

indicates the degree of similarity between

s_{e}

and

s_{r}

. Similar sentences from the same class as

s_{e}

are highlighted in orange, while sentences from the opposite class are in blue.

Figure 3. Comparison of the most important words for classification in the Contracts dataset (Class 0). Linear models (a,b) rely more on frequent words, while BERT (c) captures semantically relevant terms.

Figure 4. SVM−C0: Heatmap of PCA weights, where red indicates strong positive contributions, blue represents strong negative contributions, and neutral values appear in light shades.

Figure 5. SVM-C0-3D Visualization: representation of clusters in a 3D space. Cluster 0 is depicted in red and Cluster 1 in purple, with centroids represented as diamonds and patterns as spheres. The size and opacity of the spheres are scaled proportionally to the distance from their respective centroids. More distant patterns (larger and more solid spheres) exhibit greater variability and are potentially spurious correlations.

Figure 6. Distance of patterns to centroids in SVM-TFIDF clustering. Patterns farther from the centroid exhibit a higher potential for spuriousness.

Figure 7. Distance of patterns to centroids in SVM-WE clustering. Higher distances correlate with greater spuriousness potential.

Figure 8. Distance of patterns to centroids in BERT clustering. Patterns with higher distances exhibit greater instability and spuriousness potential.

Table 1. Summary of the original and expanded Contracts Dataset.

Class	Label	Initial Quantity	Quantity After Expansion
Other procurements	0	2298	3174
Healthcare-specific acquisitions	1	2067	2918
Total		4365	6092

Table 2. Quantity of labeled data by class and their respective labels in the Bidding Dataset.

Class	Label	Quantity
Engineering works contracting services	0	650
Acquisition of permanent goods	1	370
Acquisition of consumable goods	2	623
General services contracting	3	494
Total		2137

Table 3. Summary of the original and expanded Bidding Dataset.

Class	Label	Initial Quantity	Quantity After Expansion
Acquisition of goods	1	993	3040
Contracting of services	0	1144	3204
Total		2137	6244

Table 4. Metrics for clustering patterns, including importance weights, frequency metrics, frequency ratios, and perturbation statistics.

Metric	Description
p, c, $ρ$	Pattern (p), class (c), and global importance weight ( $ρ$ ).
$μ_{t}$ , $ϕ_{t}$	Frequency in similar sentences from the same class ( $μ_{t}$ ) and opposite class ( $ϕ_{t}$ ).
$μ_{e}$ , $ϕ_{e}$	Frequency in error sentences from the same class ( $μ_{e}$ ) and opposite class ( $ϕ_{e}$ ).
$κ_{t} = \frac{ϕ_{t}}{μ_{t}}$ , $κ_{e} = \frac{ϕ_{e}}{μ_{e}}$	Frequency ratio in similar sentences ( $κ_{t}$ ) and error sentences ( $κ_{e}$ ).
$ϱ_{p}$ , $ϱ_{a}$	Perturbation quantity ( $ϱ_{p}$ ) and number of corrected predictions ( $ϱ_{a}$ ).
$τ_{a}$ , $τ_{p}$	Accuracy after perturbation ( $τ_{a}$ ) and perturbation rate ( $τ_{p}$ ).

Table 5. Configuration of linear machine learning models.

Configuration	LRG	SVM
Text Representation	TFIDF/WE	TFIDF/WE
Hyperparameter Optimization	BayesSearchCV (C)	BayesSearchCV ( $α$ )
Cross-Validation	$k = 5$	$k = 5$
Optimizer	-	SGD

Table 6. BERT model configuration.

Configuration	BERT
Model	`BERTimbau`
Text Representation	Tokenized text
Batch Size	4 (training and testing)
Epochs	3
Optimizer	Adam
Learning Rate	2 $\times 10^{- 5}$
Regularization	Dropout = 0.1, Early Stopping

Table 7. Performance comparison of the LRG, SVM, and BERT models on the Contracts and Bidding datasets using TFIDF and Word WE.

Dataset	Model	Text Representation	Accuracy (%)	Sensitivity (%)	Specificity (%)
Contracts	LRG	TFIDF	94.58	94.24	94.93
	LRG	WE	84.23	83.19	85.27
	SVM	TFIDF	94.20	94.34	94.90
	SVM	WE	86.03	81.48	90.58
	BERT	WE	94.97	95.39	94.93
Bidding	LRG	TFIDF	97.92	97.88	97.96
	LRG	WE	93.16	93.71	92.63
	SVM	TFIDF	97.52	97.52	97.54
	SVM	WE	93.80	94.36	93.26
	BERT	WE	97.58	97.61	97.55

Table 8. Class 1 (healthcare-specific acquisitions) sentences most frequently identified as similar to class 0 (other procurements) error sentences in the Contracts dataset.

Model	Text Representation	Representative Sentence
SVM	WE	“Aquisição de insumos médicos hospitalares em caráter emergencial de acordo com a lei federal… para atender as necessidades da secretaria municipal de saúde de… no combate á pandemia do COVID-19.” *
SVM	TFIDF	“Aquisição de testes rápidos destinados à secretaria municipal de saúde.” ^‡
LRG	WE	“Aquisição de medicamentos para a secretaria municipal de saúde de… ” ^§
LRG	TFIDF	“Contrato de aquisição de alcool gel para atender as necessidade da secretaria municipal de saúde de… para enfrentamento da emergência de saúde pública decorrente do coronavírus.” ^¶
BERT	WE	“Contratação de empresa para fornecimento de testes rápidos.” ^†

Translations: * “Acquisition of hospital medical supplies on an emergency basis in accordance with federal law… to meet the needs of the municipal health department of… in combating the COVID-19 pandemic”. ^‡ “Acquisition of rapid tests for the municipal health department”. ^§ “Acquisition of medicines for the municipal health department of… ” ^¶ “Contract for the acquisition of gel alcohol to meet the needs of the municipal health department of… for coping with the public health emergency caused by the coronavirus”. ^† “Contracting a company to supply rapid tests”.

Table 9. SVM-C0: contributions of metrics to the principal components (PC1, PC2, and PC3).

Principal Component	Metrics	Values
PC1	$μ_{t}$ , $τ_{p}$ , $ϱ_{p}$	0.37
	$ϕ_{t}$ , $ϕ_{e}$	0.38
	$μ_{e}$	0.34
PC2	$κ_{t}$	−0.46
PC2	$κ_{e}$ , $τ_{a}$ , $ϱ_{a}$	0.51, 0.45, 0.44
PC3	$μ_{e}$	−0.44
PC3	$ρ$ , $κ_{e}$	0.38, 0.37

Table 10. Potentially spurious patterns in the Contracts dataset, identified by at least three model–text-representation combinations, grouped by class.

Class	Pattern	Models That Identified It
Other procurements	municipio *, servicos municipio ^†	SVM (TFIDF, WE), BERT
	contratacao ^‡, contratacao empresa ^§, contratacao municipio ^¶	SVM (TFIDF, WE), LRG (WE), BERT
	servicos ⁺	SVM (TFIDF, WE), LRG (TFIDF, WE), BERT
Healthcare-specific acquisitions	coronavirus covid ⁻	SVM (TFIDF, WE), LRG (WE)
	fornecimento covid ^@, covid ^$, aquisicao covid ^&	SVM (WE), LRG (WE), BERT
	aquisicao ^!	SVM (TFIDF, WE), LRG (TFIDF, WE), BERT

Translations: municipality *, municipal services ^†, contracting ^‡, contracting company ^§, municipal contracting ^¶, services ⁺, coronavirus covid ⁻, covid supply ^@, covid ^$, covid acquisition ^&, acquisition ^!.

Table 11. Potentially spurious patterns in the Bidding dataset, identified by at least three model–text-representation combinations, grouped by class.

Class	Pattern	Models That Identified It
Contracting of services	contratacao prestacao *, municipio ^#	SVM (TFIDF, WE), LRG (TFIDF, WE)
	contratacao especializada ^†, contratacao empresa especializada ^‡	SVM (WE), LRG (WE), BERT
	servicos ^§, contratacao ^¶	SVM (TFIDF, WE), LRG (TFIDF, WE), BERT
Acquisition of goods	fornecimento referencia ⁺, objeto fornecimento ⁻	SVM (TFIDF, WE), LRG (WE), BERT
Acquisition of goods	aquisicao ^@, fornecimento ^$	SVM (TFIDF, WE), LRG (TFIDF, WE), BERT

Translations: contracting provision *, municipality ^#, specialized contracting ^†, specialized company contracting ^‡, services ^§, contracting ^¶, reference supply ⁺, object supply ⁻, acquisition ^@, supply ^$.

Table 12. Spurious patterns with the greatest distance from cluster centroids for negative and positive sentiment classifications in the IMDB dataset, detected by ≥4 models. The patterns are ordered from left to right and top to bottom within each class grouping, following the natural reading order.

Negative Class						Positive Class
Pattern	Distance	Pattern	Distance	Pattern	Distance	Pattern	Distance	Pattern	Distance	Pattern	Distance
bad	7.54	bad movie	6.52	worst	5.27	good	12.55	great	11.27	well	4.96
better	5.22	never	3.61	plot	3.49	great love	4.92	great well	4.80	excellent	4.60
nothing	3.37	boring	3.34	bad nothing	3.03	best good	4.14	best	3.97	love	3.87
bad plot	2.92	poor	2.90	bad movie plot	2.74	man	3.74	also	3.58	great time	3.44
director	2.61	movie plot	2.22	bad script	2.19	film story	3.32	good still	3.28	life	3.11
terrible	2.09	instead	2.07	movie never	1.54	also film	2.94	still	2.88	good see	2.84
problem	1.53	ridiculous	1.39	stupid	1.31	family	2.80	also story	2.79	good time	2.78
looks	1.24	money	1.19	acting plot	1.08	enjoyed	2.74	liked	2.70	true	2.65
movie	1.02					classic	2.61

Table 13. Comparison between the proposed method and reference methods.

Criterion	Our Method	Counterfactuals	Causal Inference	Tsetlin Machines
Method	XAI + Clustering	Word substitution	Causal inference	Logical rules via TM
Patterns	Up to 3 words, robust to domains	Unigrams, limited generalization	Unigrams, needs manual tuning	Unigrams, sensitive to vocab
Human Effort	No manual intervention	Requires validation	Needs manual labeling	Needs handcrafted counterfactuals

Table 14. Common patterns identified by our method and reference methods. “Classification” refers to the reference work.

Reference	Patterns	Classification
[4]	boring	Spurious
[4]	stupid, best, enjoyed	Genuine
[7]	movie, best	Spurious
[7]	terrible, excellent, great	Genuine
[16]	best, instead, money, plot, worst, family	Spurious

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Soares, H.d.A.; Moura, R.S.; Machado, V.P.; Paiva, A.; Lima, W.; Veras, R. The Detection of Spurious Correlations in Public Bidding and Contract Descriptions Using Explainable Artificial Intelligence and Unsupervised Learning. Electronics 2025, 14, 1251. https://doi.org/10.3390/electronics14071251

AMA Style

Soares HdA, Moura RS, Machado VP, Paiva A, Lima W, Veras R. The Detection of Spurious Correlations in Public Bidding and Contract Descriptions Using Explainable Artificial Intelligence and Unsupervised Learning. Electronics. 2025; 14(7):1251. https://doi.org/10.3390/electronics14071251

Chicago/Turabian Style

Soares, Hélcio de Abreu, Raimundo Santos Moura, Vinícius Ponte Machado, Anselmo Paiva, Weslley Lima, and Rodrigo Veras. 2025. "The Detection of Spurious Correlations in Public Bidding and Contract Descriptions Using Explainable Artificial Intelligence and Unsupervised Learning" Electronics 14, no. 7: 1251. https://doi.org/10.3390/electronics14071251

APA Style

Soares, H. d. A., Moura, R. S., Machado, V. P., Paiva, A., Lima, W., & Veras, R. (2025). The Detection of Spurious Correlations in Public Bidding and Contract Descriptions Using Explainable Artificial Intelligence and Unsupervised Learning. Electronics, 14(7), 1251. https://doi.org/10.3390/electronics14071251

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Detection of Spurious Correlations in Public Bidding and Contract Descriptions Using Explainable Artificial Intelligence and Unsupervised Learning

Abstract

1. Introduction

2. Related Work

2.1. Counterfactual Generation

2.2. Data Perturbation

2.3. Explainable AI Techniques for Detecting Spurious Correlations

2.4. Other Techniques: Statistical Reweighting and Regularization Methods

3. Theoretical Framework

3.1. Natural Language Processing (NLP)

3.2. Explainable Artificial Intelligence (XAI)

3.3. Unsupervised Learning

4. Method

4.1. Datasets

4.2. Preprocessing

4.3. Train/Test the Model

4.4. Analyze Model Errors

4.5. Extracting Important Words

4.6. Extracting Investigation Patterns

4.7. Extracting Potential Spurious Patterns

4.8. Clustering Patterns

5. Model Configuration

5.1. Text Representation

5.2. Linear Classifiers: LRG and SVM

5.3. Transformer-Based Classifier: BERT

6. Experiments and Results

6.1. Interpreting PCA and Clusters in the “Other Acquisitions” Class

6.1.1. Cluster 0 Analysis

6.1.2. Cluster 1 Analysis

6.1.3. Visualization of Distances to Centroids

6.2. Potential Spurious Patterns Common to Models

Potential Spurious Patterns in the Contracts and Bidding Datasets

7. Experiment on the IMDB Dataset and Comparison with Reference Methods

8. Comparison with Reference Methods

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI