Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data

Nikolić, Miroslav; Nikolić, Danilo; Stefanović, Miroslav; Koprivica, Sara; Stefanović, Darko

doi:10.3390/math13132183

Open AccessArticle

Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data

by

Miroslav Nikolić

¹

,

Danilo Nikolić

^2,*

,

Miroslav Stefanović

²

,

Sara Koprivica

² and

Darko Stefanović

²

¹

Open Institute of Technology, University of Malta, XBX 1425 Ta’ Xbiex, Malta

²

Faculty of Technical Sciences, University of Novi Sad, 21000 Novi Sad, Serbia

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(13), 2183; https://doi.org/10.3390/math13132183

Submission received: 11 May 2025 / Revised: 10 June 2025 / Accepted: 11 June 2025 / Published: 3 July 2025

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

Probability calibration is commonly utilized to enhance the reliability and interpretability of probabilistic classifiers, yet its potential for reducing algorithmic bias remains under-explored. In this study, the role of probability calibration techniques in mitigating bias associated with sensitive attributes, specifically country of origin, within binary classification models is investigated. Using a real-world lead-generation 2853 × 8 matrix dataset characterized by substantial class imbalance, with the positive class representing 1.4% of observations, several binary classification models were evaluated and the best-performing model was selected as the baseline for further analysis. The evaluated models included Binary Logistic Regression with polynomial degrees of 1, 2, 3, and 4, Random Forest, and XGBoost classification algorithms. Three widely used calibration methods, Platt scaling, isotonic regression, and temperature scaling, were then used to assess their impact on both probabilistic accuracy and fairness metrics of the best-performing model. The findings suggest that post hoc calibration can effectively reduce the influence of sensitive features on predictions by improving fairness without compromising overall classification performance. This study demonstrates the practical value of incorporating calibration as a straightforward and effective fairness intervention within machine learning workflows.

Keywords:

probability calibration; algorithmic fairness; isotonic regression; expected calibration error; machine learning fairness; binary classification

MSC:

68T20; 68T37; 62C05

1. Introduction

The increasing reliance on machine learning (ML) models for decision-making across diverse sectors such as finance, healthcare, marketing, and public policy has underscored the importance of fairness and ethical considerations in algorithmic deployments [1,2,3,4,5,6]. Despite remarkable advancements, these models are susceptible to embedding and perpetuating societal biases that originate from skewed training datasets. Sensitive attributes like country of origin, ethnicity, physical characteristics, or gender often play a disproportionately influential role in model outcomes, exacerbating existing inequalities and eroding trust in machine learning systems. This underscores a critical need for developing responsible artificial intelligence frameworks to ensure fair, transparent, and accountable decision-making [7].

Probability calibration, an important post-processing step in binary classification, addresses discrepancies between the predicted probabilities and empirical frequencies of outcomes [8]. Calibration enhances interpretability and decision reliability by aligning the probabilistic outputs of models with real-world observations. This is especially critical in applications where precise risk assessment and probabilistic accuracy are imperative [9]. In addition, recent research has established calibration as an important fairness criterion, particularly at the demographic group level [10,11]. Group calibration requires that predicted probabilities are accurate within each demographic subgroup, ensuring that a prediction of, for example, 70% probability has the same meaning regardless of an individual’s sensitive attributes [12,13].

While the relationship between calibration and fairness has been explored in terms of group-level metrics and error rate parity [10,11], the specific impact of calibration on the predictive marginal contributions of individual features—particularly sensitive attributes—remains largely unexplored in the literature. This gap is significant because understanding how calibration affects feature-level influences could provide new insights into bias mitigation strategies. Building on theoretical foundations that suggest that calibration can help align predictions across groups [12] and reduce certain forms of predictive bias [13], this study hypothesizes that calibrated models inherently reduce the disproportionate influence of socially sensitive features by recalibrating their predictive significance. This novel perspective extends beyond traditional group-level fairness metrics to examine feature-level effects, thus contributing to a more nuanced understanding of how calibration techniques can diminish algorithmic bias.

In this study, the effectiveness of probability calibration in mitigating feature-level biases on a practical dataset from a marketing agency was empirically investigated. The agency employs predictive models to select potential clients based on various attributes, including sensitive societal variables such as country of origin. To the authors’ knowledge, this represents one of the first empirical investigations into how calibration techniques affect the marginal contributions of sensitive features in predictive models. Through analysis, it is demonstrated that calibration techniques not only improve predictive accuracy but also significantly reduce biases, promoting equitable outcomes without compromising performance.

Beyond the Introduction and Conclusion, this paper is organized as follows: Section 2 presents a literature review on algorithmic bias, fairness metrics, and calibration techniques in binary classification. Section 3 describes the dataset, preprocessing procedures, and the experimental pipeline, including model selection and fairness evaluation metrics. Section 4 details the results of model training, performance evaluation, and the effects of post hoc calibration on both prediction accuracy and fairness indicators. Finally, the conclusion summarizes the key findings, emphasizing the importance of probability calibration for reducing bias in machine learning models used in decision-making processes.

2. Literature Review

This section provides an overview of existing research in the field of algorithmic bias, fairness metrics, and probability calibration techniques in binary classification models.

2.1. Societal Bias in Algorithmic Decision-Making

Algorithmic bias has emerged as a central ethical concern within machine learning-driven decision processes. In [14], it is highlighted how algorithms trained on historical data often reflect and reinforce historical societal biases. Also, the term “weapons of math destruction” emphasizes the harmful societal impact of opaque and biased algorithms [15]. Bias in algorithmic systems can perpetuate discriminatory practices, leading to unequal opportunities and adverse outcomes for marginalized communities [16]. These concerns necessitate rigorous bias detection and mitigation strategies to ensure algorithms serve diverse populations equitably.

Scientific research has provided numerous empirical examples of algorithmic bias in practical applications, particularly within computer vision systems. For instance, recent studies have identified significant biases in facial recognition technologies used in critical domains such as law enforcement and healthcare. It was also demonstrated that popular facial recognition systems exhibited substantial racial and gender biases, with error rates significantly higher for darker-skinned and female individuals compared to lighter-skinned and male counterparts [17]. Similarly, in [18], a comprehensive evaluation of widely used commercial facial analysis systems was conducted, revealing notable discrepancies in accuracy across demographic groups. These discrepancies underscore the potential for adverse impacts in sensitive use cases such as security screening and medical diagnostics. These biases not only undermine fairness and equity but also compromise the reliability and public acceptance of artificial intelligence systems.

2.2. Fairness Metrics

To measure and quantify fairness, several underlying metrics have been developed:

Demographic parity demands equal probability of positive outcomes across groups, irrespective of actual outcomes [19]. Formally, demographic parity is defined as:

$P (\hat{Y} = 1 | A = a) = P (\hat{Y} = 1 |A = b), \forall a, b \in A$

(1)

where $\hat{Y}$ represents predicted outcomes and $A$ represents the sensitive attribute with possible values of a, b $\in A$ ;
Equalized odds represent another prevalent fairness criterion that was introduced in [16]. It requires equal true-positive and false-positive rates across groups, formally expressed as:

$P (\hat{Y} = 1 | Y = y, A = a) = P (\hat{Y} = 1 | Y = y, A = b), \forall y \in {0,1};$

(2)
Predictive parity focuses on calibration fairness and mandates equal predictive accuracy across groups, thus ensuring that predictions reflect true likelihoods consistently among demographics:

P (Y = 1 | \hat{Y} = 1, A = a) = P (Y = 1 | \hat{Y} = 1, A = b);

(3)

Balancing these criteria often introduces trade-offs, meaning that simultaneously achieving multiple fairness objectives can be fundamentally challenging [20]. This was illustrated through the context of recidivism prediction, showing how simultaneously ensuring predictive parity and equal error rates (false positives and false negatives) across demographic groups is mathematically impossible when base rates differ [19]. Authors in [20] similarly assert that fairness constraints inherently involve trade-offs; achieving equal false-positive rates, equal false-negative rates, and calibration across groups simultaneously is not achievable without significantly compromising accuracy or introducing randomness into the decision-making process. These theoretical results stress the complexity and nuanced nature of fairness in machine learning, indicating that context-specific trade-offs must be carefully considered to guide ethical machine learning system design.

These metrics form the foundation for evaluating algorithmic fairness, guiding efforts to mitigate disparities in model outcomes. Building on this foundation, the next section explores probability calibration techniques as a post-processing strategy for improving model reliability and fairness.

2.3. Calibration Techniques in Binary Classification

In this section, widely used probability calibration techniques in binary classification are introduced, emphasizing their strengths, limitations, and evaluation metrics.

2.3.1. Calibration Algorithms

Probability calibration is a vital post-processing step in binary classification that corrects discrepancies between a model’s predicted probabilities and the true empirical frequencies of outcomes. Many machine learning models, including Logistic Regression, support vector machines, and ensemble methods like Random Forest or gradient boosting methods, produce miscalibrated probability estimates that do not accurately reflect real-world likelihoods [10,21]. This miscalibration can compromise decision-making, particularly in high-stakes applications such as marketing, where precise risk assessments influence client selection, or in healthcare and finance, where reliability is paramount [18]. Calibration techniques address this issue by adjusting predicted probabilities to align with observed outcomes, enhancing model interpretability and trustworthiness.

Calibration techniques can be classified into parametric and non-parametric categories. Parametric methods impose a specific functional form on the calibration process, while non-parametric methods offer flexibility by avoiding such assumptions. Below, three widely adopted techniques were examined—Platt scaling, isotonic regression, and temperature scaling—focusing on their strengths and limitations.

Platt scaling, proposed by [22], is a parametric method that transforms a model’s raw output scores (logits) into calibrated probabilities using a Logistic Regression. It models the relationship between logits and true probabilities as a sigmoid function:

$p (x) = \frac{e^{β_{0} + β_{1} + β_{p}}}{1 + e^{β_{0} + β_{1} + β_{p}}},$

(4)

where x represents the model’s logit output and $β_{i}$ represent the variables’ parameters. Platt scaling is computationally efficient and particularly suited for models like support vector machines, which do not natively produce probabilities [22]. Its simplicity makes it widely applicable, but its reliance on a logistic assumption limits its effectiveness when the true relationship between logits and probabilities is non-linear or complex. In the dataset used in this research, where feature interactions may deviate from logistic patterns, Platt scaling serves as a baseline method but may require supplementation with more flexible approaches [23].
Isotonic regression, introduced by [24], is a non-parametric calibration method that fits a piecewise constant and monotonically increasing function to transform predicted scores into calibrated probabilities. It minimizes the squared error between the calibrated probabilities and true outcomes, subject to a monotonicity constraint:

$m i n \sum_{i = 1}^{m} (y_{i} - \hat{y_{i}})^{2} subject to {\hat{y}}_{1} \leq {\hat{y}}_{2} \leq \dots \leq {\hat{y}}_{m},$

(5)

where $y_{i}$ are the true binary labels, and ${\hat{y}}_{i}$ are the calibrated probabilities for m instances. This method excels in capturing non-linear relationships, making it highly adaptable to diverse datasets [24]. In practice, isotonic regression has demonstrated superior calibration performance compared to parametric methods, especially in settings with sufficient data to avoid the “curse of dimensionality”, i.e., to avoid overfitting due to the higher dimensionality of a dataset [23]. In this research, isotonic regression’s flexibility was advantageous, as it was well adjusted to complex patterns involving sensitive attributes like country of origin, potentially reducing their biased influence.
Temperature scaling, popularized by [21] in the context of deep learning, is a parametric method that adjusts logits using a single scalar parameter, T, before applying the softmax function. Hence,

$p (x) = \frac{1}{1 + e^{{(β}_{0} + β_{1} + β_{p}) / T}},$

(6)

where T > 1 softens the probability distribution, reducing the model’s overconfidence in probability estimations, while T < 1 sharpens it. The parameter T is often optimized on a validation set to minimize a loss function, such as negative log-likelihood. Temperature scaling is computationally lightweight and preserves the rank order of predictions, making it particularly effective for neural networks. One of the advantages of this calibration approach is its effectiveness in calibrating multiclass probabilities.

2.3.2. Evaluation Metrics for Probability Calibration

To assess calibration quality, standard metrics include the Brier score (BS) and expected calibration error (ECE).

The BS quantifies the mean squared difference between predicted probabilities and actual outcomes:

B S = \frac{1}{m} \sum_{i = 1}^{m} ({\hat{y}}_{i} - y_{i})^{2};

(7)

It ranges from 0 to 1, with 0 being the best score and 1 being the worst score in terms of calibration. The problem with the BS is related to highly imbalanced datasets. A lower BS means that calibration is good overall, but not necessary for the minority class. One of the possible solutions is to compose a class-level BS:

{B S}^{p o s i t i v e c l a s s} = \frac{1}{m_{p o s i t i v e c l a s s}} \sum_{i = 1}^{m_{p o s i t i v e c l a s s}} ({\hat{y}}_{i} - y_{i})^{2},

(8)

for the positive class, and:

{B S}^{n e g a t i v e c l a s s} = \frac{1}{m_{n e g a t i v e c l a s s}} \sum_{i = 1}^{m_{n e g a t i v e c l a s s}} ({\hat{y}}_{i} - y_{i})^{2},

(9)

for the negative class [23].

The ECE measures the average discrepancy between predicted confidence and empirical accuracy across probability bins:

E C E = \sum_{j = 1}^{B} \frac{| B_{j} |}{m} | a c c (B_{j}) - c o n f (B_{j}) |,

(10)

where

B_{j}

denotes the j-th bin,

| B_{j} |

is the number of instances in that bin, while

a c c (B_{j})

and

c o n f (B_{j})

are the accuracy and average predicted confidence, respectively [23].

These metrics are critical for evaluating how well calibration techniques align probabilities with true outcomes in a dataset, providing a foundation for subsequent bias reduction analysis.

2.4. Impact of Calibration on Feature Bias

The recent literature [25,26,27] suggests that probability calibration improves a model’s predictive reliability and contributes to mitigating bias, primarily by aligning predicted probabilities across demographic groups (e.g., nationality, race, country of origin). In [25], it is demonstrated that, for instance, ensemble-based learning methods achieve fairer predictions on a group level once calibrated, highlighting how accurate probability estimates reduce disparities in aggregate outcomes. Likewise, in [26], authors integrate an explicit fairness objective into decision tree induction through their “Fair Forests” framework, contending that well-calibrated probabilities provide a robust foundation for implementing bias-sensitive splits and leaf-level predictions. Furthermore, according to [27], there is empirical evidence that fairness-aware threshold adjustments on calibrated outputs can substantially diminish the disparate impact across demographic groups, often without compromising overall predictive performance.

The synergy between calibration and fairness arises from the principle that well-calibrated models produce probability estimates more closely reflecting true empirical frequencies. By doing so, these models inherently curtail the excessive weight assigned to certain features, especially those correlated with protected attributes.

However, while these studies underscore the importance of calibration for achieving group-level fairness, they do not explicitly investigate whether calibration reduces the marginal influence or predictive importance of individual sensitive features in model outputs. This gap highlights the need for further research into feature-level effects of calibration, which is the focus of the current study.

Hence, the key contribution of this paper lies in empirically demonstrating how post hoc calibration techniques can reduce the marginal influence of sensitive attributes, thereby contributing a novel quantitative perspective to the discourse on algorithmic fairness.

Through empirical evaluation on the dataset—where sensitive features potentially skew decisions—it is demonstrated that calibration effectively decreases the undue predictive influence of these attributes, thus serving as a powerful lever for bias mitigation. The findings underscore the importance of integrating probability calibration into binary classification algorithmic pipelines as a means to foster more equitable and trustworthy decision-making.

3. Methodology

This section describes the dataset, preprocessing steps, model selection, and evaluation metrics used in the experimental analysis.

3.1. Dataset Characteristics

The dataset under consideration originates from Degordian [28], a marketing and analytics agency, encompassing 2853 observations collected from 17 March 2018 to 27 January 2023. Each observation represents an individual prospect (usually businesses) reaching out to the agency for potential collaboration on various marketing and analytics services.

The dataset comprised 7 independent variables and 1 dependent variable, organized as follows:

Independent variables:
- form (nominal variable) indicates the channel through which the prospect contacted the agency, with possible values including email or contact form.
- country_of_origin (nominal variable) captures the prospect’s self-reported location, restricted to one of six countries: Serbia, Slovenia, Montenegro, Croatia, Germany, or Italy.
- email (nominal variable) is the email addresses of the prospects;
- days_to_contact (ratio variable) specifies the number of days elapsed between the prospect’s first recorded visit to the agency’s website and the moment they initiated contact (via a form or email). Zero indicates a same-day visit and a contact.
- logins (ratio variable) represents how many times the prospect logged into a free account on the agency’s platform before making a formal contact (through email or contact form). A value of zero implies that the user did not log in at all.
- number_of_visits (ratio variable) specifies the number of visits to the agency’s website before making a formal contact.
- projects (nominal variable) denotes potential projects or services (e.g., SEO, web analytics, online advertising) for collaboration that the prospect stated in the contact form or email.
Dependent variable:
- collaboration (ordinal variable): This variable has binary labels that reflect historical outcomes linking a prospective contact to a successful partnership. The dataset is very unbalanced, with successful collaboration being presented in 1.4% of observations.

The dataset was compiled from several internal systems at the agency: website analytics logs capturing initial visits, user account databases recording login activity, and records of direct form submissions or email exchanges. A cross-referencing process then aligned each prospect’s visit and login data with their eventual outcome, thereby ensuring accuracy and consistency across the integrated dataset.

3.2. Data Pre-Processing

The dataset examined was devoid of missing values or duplicated rows. It comprised nominal variables (form, email, country of origin, and projects) that were encoded as follows:

form was binary encoded as 1 when it was an email and 0 otherwise.
email categories were classified as business (1) or private (0) based on their domain, where emails from known consumer service providers (Gmail, Outlook, Yahoo, Hotmail, AOL, iCloud, and other free email services) were coded as private (0). All other domains, including corporate, educational, government, and organizational domains, were coded as business (1). This classification was performed by extracting the domain portion of each email address and comparing it against a predefined list of consumer email providers.
The country_of_origin variable was encoded using OneHot encoding, which created binary dummy variables for each unique country in the dataset. This transformation converted the categorical country variable into k-1 binary columns (where k represents the number of unique countries), with each column indicating the presence (1) or absence (0) of a specific country in the dataset. One country category was dropped as the reference group to avoid multicollinearity in the regression analysis [29].
The projects variable was encoded with cardinality encoding by transforming each row’s multi-valued category set into a single integer that represents the total number of distinct categories in that row (range: 0 to 41). Cardinality encoding was chosen for the projects variable to avoid the curse of dimensionality that OneHot encoding would create (41 columns plus their combinations for multi-label data) while maintaining better interpretability than binary encoding. Binary encoding would transform project combinations into binary numbers (e.g., 101101), making it difficult to interpret the relationship between encoded values and actual project selections. Cardinality encoding preserves a meaningful business interpretation—the count of selected projects directly represents the scope and complexity of the prospect’s needs, where higher values indicate more comprehensive service requirements.

The dataset also included 3 predictors measured on ratio scales:

days_to_contact (range: 0 to 225);
logins (range: 0 to 24);
number_of_visits (range: 3 to 746).

The distribution of the binary encoded nominal variables in the dataset is as follows: for forms, 2279 instances were encoded as 1 and 574 instances were encoded as 0; for emails, 2238 instances were marked as 0 and 615 instances were labeled with 1. As shown in Table 1, all ratio variables are right-skewed.

During the dataset preparation, aside from the encoding of the nominal variables, data normalization was employed prior to testing Binomial Logistic Regression and Polynomial Logistic Regression algorithms. The normalization was conducted as follows:

\frac{X j - {\bar{X}}_{j}}{I Q R},

(11)

where

X_{j}

specifies the input j,

{\bar{X}}_{j}

specifies the median value of the input j, and IQR represents the interquartile range. During the training phase of the tree-based algorithms, variables were used in their original scales.

3.3. Pipeline Evaluation Design

3.3.1. Model Selection and Hyperparameter Tuning

Four classification algorithms have been evaluated, each with tuned hyperparameters, to identify the optimal model. The selection of algorithms was guided by their distinct characteristics to encompass a diverse range of modeling approaches, including linear models, bootstrap aggregation methods, and gradient boosting techniques. This strategy aimed to identify the optimal algorithm by ensuring a comprehensive evaluation across different paradigms (deep learning approaches were not tested due to the limited size of the training data and its tabular nature). Tested and fine-tuned algorithms included the following:

Binary Logistic Regression is a linear model that uses the Sigmoid function and produces the probability of the positive class as the output. Values 0, 0.1, 0.2, and 0.7 were tested for the $l 2$ regularization hyperparameter. The binary cross-entropy was used as the cost function. The $l 2$ regularization introduces a penalty to the model’s cost function:

$l 2 = \sum_{j = 1}^{p} {β_{j}}^{2},$

(12)

where p represents the number of parameters and $β$ represents the coefficient [30].
Polynomial Logistic Regression extends the linear boundary to higher-degree polynomials. Two- to four-degree polynomials were tested while keeping the $l 2$ values the same as in the Binary Logistic Regression.
Random Forest is a tree-based bootstrap aggregation ensemble algorithm that makes predictions based on the hard voting of multiple decision trees [30]. The number of trees (50, 100, 300, 500), maximum depth (3, 5, 6, 10), and minimum entropy reduction threshold (0, 0.05, 0.1) were varied in the hyperparameter fine-tuning process.
eXtreme Gradient Boosting (XGBoost) is a tree-based gradient boosting ensemble algorithm. The algorithm demonstrates two underlying differences in comparison to the Random Forest algorithm: (i) it trains decision trees and minimizes the cost sequentially, and (ii) it is not based on the sampling with replacement where each new observation has the uniform probability to be selected ( $1 / m$ ), but rather weights more the misclassified observations from the previous trees in order to minimize the cost sequentially (it works on the residuals improvements (Residual improvements refer to the iterative error correction process in XGBoost. Each new tree is trained to predict the errors (residuals) made by all previous trees combined. For example, if the first tree predicts a value of 7 but the actual value is 10, the residual is 3. The next tree then focuses on predicting this error of 3, progressively reducing the overall prediction error with each sequential tree [31]) at each sequence) [31]. In the hyperparameter fine-tuning process, the number of trees (50, 100, 300, 500, 700, and 800) and L2 regularization (0, 0.1, 0.2, and 0.7) were tested.

Numerous studies [30,32,33,34,35,36] found that adjusting class imbalance at either the data or algorithmic level often distorts the predicted probabilities, leading to unreliable probability estimates. The artificially modified class distribution results in miscalibration, which undermines the trustworthiness and interpretability of predictions. Moreover, the observed degradation in calibration can outweigh any gains in classification metrics for the minority class. Consequently, to preserve accurate probability estimates in this research, altering class imbalance at the data or algorithm level was avoided. All hyperparameters and their corresponding values evaluated through grid search with 5-fold stratified cross-validation are summarized in Table 2.

3.3.2. Evaluation Metrics

To address the distinct phases of the research pipeline, four evaluation metrics were introduced, two of which specifically assess algorithmic bias.

In the model selection phase, all models were assessed using the $F_{β = 2}$ evaluation metric, emphasizing recall to minimize false negatives in prospective collaborations while considering precision:

$F_{β = 2} = (1 + 2^{2}) \times \frac{P r e c i s i o n \times R e c a l l}{{(2}^{2} \times P r e c i s i o n) + R e c a l l} .$

(13)

Recent work has shown that non-decomposable metrics such as F can be embedded directly into a differentiable objective, eliminating the need for surrogate losses or heuristic resampling [37].
Probability calibration of the best-performing model from the first phase was evaluated with expected calibration error and stratified BS for positive and negative classes.
In the last phase, the performance of the best-performing model from the first phase was explored with the performance of a calibrated best-performing model from the first phase. This was carried out using four evaluation metrics: (i) $F_{β = 2}$ (ii) probability calibration metrics (ECE and stratified Brier scores), (iii) the marginal contribution of a sensitive variable in the model’s predictions, and (iv) predictive parity. Evaluation metrics (iii) and (iv) were used to evaluate the fairness of the model.

3.3.3. Sensitive Variable Marginal Contribution

To quantify each predictor’s influence on predictions, the SHapley Additive exPlanations (SHAP) algorithm was applied. Introduced in 2017, SHAP is a method for reverse-engineering the behavior of any predictive model. The algorithm is based on Shapley values from game theory, which calculate the marginal contribution of each feature (or “player”) to the final prediction [38,39]. To increase explainability, SHAP values were converted into probabilities. The marginal contribution of a variable on a local level was calculated as the difference between the predicted probability and the interpolated sum of the Shapley values, subtracted by the Shapley value of the feature. Formally:

Δ P = {\hat{y}}_{i} - f (\sum_{j = 1}^{n} s h a p_{i} - s h a {p_{i}}^{j}),

(14)

where

Δ P

represents the marginal contribution of the variable to the predicted probability,

{\hat{y}}_{i}

is the predicted probability for instance i,

\sum_{j = 1}^{n} s h a p_{i}

is the total sum of all Shapley values for instance i, while

s h a {p_{i}}^{j}

denotes the Shapley value of a feature j for instance i.

Specifically, a decrease in

Δ P

, after applying calibration, indicates a diminished sensitivity to the sensitive predictor, thereby implying reduced bias. A visual overview of the research pipeline, outlining each phase from data preprocessing to fairness evaluation, is presented in Figure 1.

4. Results

This section presents the outcomes of model training, evaluation, and the impact of probability calibration on predictive performance and fairness.

4.1. Models Performance

Following the described Methodology, six configurations of binary classification algorithms were evaluated: Logistic Regression (LR), Polynomial Logistic Regression with increasing polynomial degrees (2, 3, and 4), Random Forest (RF), and XGBoost. Each configuration underwent a hyperparameter-tuning phase using a 5-fold stratified cross-validation procedure, with

f_{β = 2}

as the primary performance metric. Figure 2 summarizes the outcomes on training, cross-validation, and test sets, with error bars representing 95% confidence intervals.

Binary Logistic Regression (best hyperparameters: $l 2 = 0.1$ ): It achieved moderate $f_{β = 2}$ scores across training (0.56), cross-validation (0.53), and test sets (0.52). Despite its simplicity, it served as a benchmark for comparing more advanced models.
Polynomial Logistic Regression with second-degree polynomials (best hyperparameters: $l 2 = 0.2$ ): Introducing second-degree polynomials improved the $f_{β = 2}$ score to 0.58 (training set), 0.57 (cross-validation set), and 0.54 (test set). The additional polynomial terms helped capture non-linear relationships but slightly increased the risk of overfitting.
Polynomial Logistic Regression with third-degree polynomials (best hyperparameters: $l 2 = 0.2$ ): With third-degree polynomial terms, the model reached an average training score of 0.58, a cross-validation score of 0.56, and a test score of 0.54;
Polynomial Logistic Regression with fourth-degree polynomials (best hyperparameters: $l 2 = 0.7$ ): The further increase in polynomial complexity achieved the training $f_{β = 2}$ score of (0.57), the cross-validation score of (0.55), and test score of (0.51).
Random Forest (best hyperparameters: number of trees = 100, $l 2 = 0$ , maximum depth = 3, minimum entropy reduction = 0): Random Forest provided higher scores with slightly higher variance, achieving a 0.81 $f_{β = 2}$ score on the training set, 0.70 $f_{β = 2}$ score on the cross-validation set, and 0.67 $f_{β = 2}$ score on the test set. The ensemble nature of the algorithm helped stabilize predictions, although increasing the forest size or depth did not consistently yield higher scores. The higher l2 regularization partially mitigated the overfitting but did not fully close the created gap.
eXtreme Gradient Boosting (best hyperparameters: number of trees = 300, $l 2 = 0.2$ ): XGBoost emerged as the top-performing algorithm, with the highest $f_{β = 2}$ scores across all three datasets: 0.87 (training set), 0.82 (cross-validation set), and 0.80 (test set). Its sequential tree-building process appears to be better suited to capture the complex patterns in the unbalanced datasets, while moderate l2 regularization prevented severe overfitting. Furthermore, XGBoost exhibited stable performance with narrow 95% confidence intervals, indicating low variance across cross-validation folds.

Overall, XGBoost demonstrated the best generalization in

f_{β = 2}

score, indicating a stronger capability to detect positive (collaboration) cases while maintaining reasonable precision. The polynomial expansions in Logistic Regression did offer incremental gains but were ultimately outperformed by tree-based ensembles. The chosen XGBoost model, configured with 300 boosting rounds and

l 2

regularization of 0.2, served as the baseline for subsequent calibration and bias-reduction experiments.

4.2. Base Model Evaluation

After fixing the XGBoost configuration identified in Section 4.1 (300 estimators,

l 2 = 0.2

), probabilistic reliability and fairness was analyzed prior to any post hoc calibration. Figure 3 visualizes those results, where the y-axis represents ECE, Brier scores for positive and negative classes (all ranging from 0 to 1, with lower values indicating better calibration), and predictive parity values (PPVs) for each country (representing precision scores ranging from 0 to 1, with higher values indicating better performance):

Calibration quality: ECE and stratified Brier scores showed that, although the model discriminates well (high $f_{β = 2}$ ), its probability outputs remain overconfident. This is a common problem present in gradient boosting ensemble algorithms [10]. Both class-specific Brier scores sit far above the irreducible noise for this prevalence, confirming that raw scores cannot be interpreted as reliable probabilities.
Fairness implications: The predictive parity values shown in Figure 3 were computed as P(Y = 1|Ŷ = 1, A = country of origin) for each country of origin, which equals the precision (positive predictive value) for each country of origin subgroup as defined in Equation (3) of Section 2.2. Perfect predictive parity would require these values to be equal across all countries of origin. However, the observed values range from 0.78 (Germany) to 0.28 (Montenegro), revealing substantial disparities. Because predictive parity equals subgroup-specific precision (Section 2.2), the 0.78 → 0.28 spread revealed a systematic underperformance for Serbian and Montenegrin prospects. Two factors compounded this:
- Calibration deficit: over-confident probabilities inflate false positives disproportionately in low-base-rate countries.
- Sample-size imbalance: Serbia (372) and Montenegro (326) contribute <25% of the German sample (1123), raising variance in posterior estimates.

Such disparities violate the conditional-use-accuracy fairness principle and risk of operational discrimination in the agency’s lead-selection workflow.

Furthermore, mean marginal contributions of predictors were tested in making false predictions (as defined in Section 3.3.3). As shown in Table 3, country_Serbia and country_Montenegro predictors contributed ≈ 4–6 times more to erroneously positive predictions than any other predictors. The same predictors exhibited the strongest negative deltas among false negatives, implying an over-correction when the true class is positive.

Together with the predictive-parity gaps (0.32 and 0.28; Figure 3), these SHAP-based deltas confirmed that country_Serbia and country_Montenegro were influential sources of bias in the base model. The relative stable contributions of the majority-group inputs (country_Germany, country_Croatia, country_Slovenia, and country_Italy) further highlighted the disproportionate influence of Serbia and Montenegro predictors on model predictions.

4.3. Post Hoc Model Calibration

Probability calibration was performed on the XGBoost baseline model by applying three widely used post-processing techniques—Platt scaling, isotonic regression and temperature scaling—using the hold-out calibration set (test set, 20% of the original dataset). Calibration quality was quantified with the ECE and the stratified BS for the positive and negative classes, in accordance with the evaluation protocol defined in Section 3. Figure 4, Figure 5 and Figure 6 present the resulting metrics, and their y-axes are expressed in exactly the same readily interpretable units as Figure 3: ECE and class-specific Brier scores (all ranging from 0 to 1, where lower values denote better calibration) alongside country-of-origin-level predictive-parity values (PPVs), ranging from 0 to 1, where higher values indicate better performance.

Isotonic regression provided the strongest calibration improvement, cutting ECE by 50% (0.12 → 0.06) and reducing both class-specific Brier scores by about 44% (0.09 → 0.05). Temperature scaling offered some improvement, trimming ECE by about one-third (0.12 → 0.08) and lowering the Brier scores by ≈11% (0.09 → 0.08). Platt scaling performed reasonably well, lowering ECE by 25% (0.12 → 0.09) and bringing Brier scores down by roughly 11% (0.09 → 0.08). This superior performance of isotonic regression is supported by its monotonic, piecewise constant nature, which better accommodates strongly non-linear score–probability relationships that are typically produced by gradient-boosted ensembles. In contrast, both temperature scaling and Platt scaling constrain the mapping to parametric forms (power transformation and logistic curve, respectively), assumptions that may be suboptimal for highly imbalanced lead-generation datasets.

In accordance with the research framework, the isotonic regression-calibrated XGBoost model was propagated to Stage 4 of the pipeline, wherein the addition of ECE and stratified Brier scores, its effect on (i)

f_{β = 2}

score, (ii) the marginal contributions of the sensitive country-of-origin indicators, and (iii) predictive-parity gaps were assessed.

4.4. Fairness and Marginal Contribution Insights

After applying isotonic regression, the classifier was re-evaluated on the held-out test set. A five percent absolute gain in the

f_{β = 2}

score—from 0.80 to 0.85—was obtained, indicating that the calibration step preserved, and slightly enhanced, the model’s ability to recover true positives while maintaining high precision.

To examine whether the fairness gains observed in Section 4.3 translated into reduced societal feature sensitivity, the mean marginal contributions (ΔP) of each predictor were recomputed for the calibrated model. Table 3 lists the average marginal contributions of the calibrated model’s predictors in predicted probability for false-positive predictions (

\bar{{Δ P}_{F P}}

) and false-negative predictions (

\bar{{Δ P}_{F N}}

).

As shown in Table 4, the average marginal contribution of the country_Serbia variable to false-positive predictions fell from 0.14 to 0.05 (a 64% drop in magnitude) while the impact on false-negative predictions decreased from −0.11 to −0.05 (a 55% smaller magnitude). Furthermore, for country_Montenegro, variable reductions were similar—from 0.12 to 0.04 (−66.7%) for false-negative predictions and from −0.13 to −0.06 (−54%) in the case of false-positive predictions.

Combined with the lower ECE and class-specific Brier scores already reported, these results confirmed that isotonic regression calibration simultaneously improved probabilistic reliability and mitigated country-specific bias, meeting the dual optimization criterion laid out in the research framework.

To address the statistical significance of calibration improvements, bootstrap-based confidence interval estimation was implemented following established practices in machine learning evaluation [40,41]. Stratified bootstrap resampling with 1000 iterations was employed to preserve the class distribution (1.4% positive class) while generating empirical distributions of the evaluation metrics.

For each bootstrap sample:

The test set was resampled with replacement while maintaining class proportions;
ECE and Brier scores were calculated for both uncalibrated and calibrated models;
The difference in metrics between calibrated and uncalibrated versions was computed.

A 95% confidence interval was reported using the percentile method, which provides robust estimates without distributional assumptions [42]. This approach is particularly suitable for calibration metrics, which may not follow normal distributions [43].

As shown in Figure 7, the calibrated model confidence intervals show maximum variations of approximately 1% around the point estimates, confirming the stability of calibration improvements presented in this study. All improvements are statistically significant with p-values < 0.001, calculated as the proportion of bootstrap samples where the calibrated model performed worse than the baseline.

4.5. Robustness Analysis

To assess the stability and generalizability of the proposed calibration-based bias-reduction approach, a comprehensive robustness analysis was conducted. Robustness evaluation is essential in machine learning applications to ensure that model performance and fairness properties persist under realistic deployment conditions where data quality may be compromised [44,45]. This analysis examines three critical dimensions of robustness: noise resilience, feature perturbation tolerance, and missing data handling.

4.5.1. Noise Injection Analysis

Real-world data collection processes are inherently susceptible to measurement errors, sensor noise, and human input variability [46,47]. To evaluate the robustness of the isotonic regression calibration approach under such conditions, Gaussian noise was systematically injected into the numerical features of the test dataset. The noise injection protocol followed established practices in machine learning robustness evaluation [44,48], where noise levels were calibrated relative to the natural variability of each feature.

For each numerical predictor x_j, noise was added according to:

{x_{j}}^{n o i s y} = x_{j} + Ɲ (0, ϵ \cdot {σ_{j}}^{2}),

(15)

where

Ɲ (0, ϵ \cdot {σ_{j}}^{2})

represents Gaussian noise with zero mean and variance proportional to the feature’s empirical standard deviation

{σ_{j}}^{2}

, and

ϵ \in {0.1, 0.2, 0.3}

represents increasing noise intensity levels corresponding to 10%, 20%, and 30% of the feature’s natural variability [48,49].

Table 5 presents the calibration performance metrics and fairness indicators under progressive noise injection. The isotonic regression calibration demonstrates remarkable stability across all noise levels. Even under the highest noise condition (30%), the ECE increases only marginally from 0.06 to 0.08, while the stratified Brier scores for both classes remain consistently low (≤0.06). The

f_{β = 2}

score shows minimal degradation, decreasing from 0.85 to 0.82 under maximum noise conditions, indicating preserved discriminative ability.

Critically, the predictive parity metrics across countries demonstrate consistent patterns under noise injection. The higher-performing countries (Germany, Croatia, Slovenia and Italy) maintain their precision levels with only gradual decreases, while the disparity with lower-performing countries (Serbia, Montenegro) remains stable. This consistency suggests that the calibration-induced fairness patterns are not artifacts of specific data conditions but represent robust algorithmic behavior [50,51].

The analysis of marginal contributions under noise injection, presented in Table 6, provides deeper insights into feature-level robustness. Notably, the country-of-origin-specific features (Serbia, Montenegro) maintain relatively low marginal contributions to false predictions across all noise levels, confirming that the bias reduction achieved through calibration persists under reduced data quality. For instance, country_Serbia’s contribution to false positives remains in the range of 0.04–0.08 across noise conditions, substantially lower than the uncalibrated baseline of 0.14 reported in Table 6 of the main analysis.

The robustness of non-sensitive features also demonstrates stability, with predictors such as logins, number_of_visits, and days_to_contact maintaining consistent marginal contribution patterns. This stability indicates that the calibration process does not introduce artificial sensitivity to noise in the broader feature space while specifically targeting bias reduction in sensitive attributes [52].

4.5.2. Feature Perturbation Analysis

Feature perturbation analysis evaluates model resilience when individual data points are corrupted or when systematic biases affect specific features [53,54]. This scenario commonly occurs in practice due to data transmission errors, system failures, or deliberate data manipulation attempts. The perturbation protocol randomly selected features for modification following a Bernoulli process with probability

p \in {0.05, 0.10, 0.15}_{,}

representing 5%, 10%, and 15% feature corruption rates [55].

For categorical features (form, email, and country-of-origin indicators), perturbation involved random label flipping, while numerical features were replaced with values sampled from their empirical distributions. This approach simulates realistic corruption patterns while preserving the overall data structure [56,57].

The results, summarized in Table 7, demonstrate that the non-parametric nature of isotonic regression provides inherent robustness advantages over parametric calibration methods [22,58]. Unlike Platt scaling, which relies on specific distributional assumptions, isotonic regression’s piecewise-constant monotonic mapping adapts organically to perturbed data patterns [22,59]. The calibration performance metrics exhibit graceful degradation under increasing perturbation levels: ECE increases modestly from 0.06 to 0.07, while the stratified Brier scores remain low (≤0.07) across all perturbation intensities.

Notably, the

f_{β = 2}

score demonstrated stability, declining only from 0.85 to 0.81 under maximum perturbation conditions (30%). This preservation of discriminative performance indicates that the calibrated model maintains its ability to identify true-positive cases even when substantial portions of the feature space are corrupted [21].

The predictive parity analysis reveals consistent fairness patterns under feature perturbation. Higher-performing countries (Germany, Croatia, Slovenia, Italy) maintain their precision advantages with only gradual decreases, while the performance gap with lower-performing countries (Serbia, Montenegro) remains stable. Specifically, the PPV disparity between Germany and Serbia decreases only marginally from 0.35 (0.80–0.45) in clean conditions to 0.35 (0.77–0.42) under maximum perturbation, indicating that the fairness characteristics observed in optimal conditions persist under realistic data corruption scenarios [16,51].

The marginal contribution analysis under feature perturbation, presented in Table 8, provides critical insights into bias stability. The country-specific features maintain relatively balanced marginal contributions across perturbation levels, with no single country demonstrating a disproportionate influence on growth. For instance, country_Serbia’s contribution to false positives ranges from 0.05 to 0.10 across perturbation conditions, remaining substantially below the uncalibrated baseline of 0.14. Similarly, country_Montenegro exhibits controlled marginal contributions (0.04–0.11), confirming that the bias reduction achieved through calibration persists under feature-level corruption.

The uniform increase in marginal contributions across all predictors under higher perturbation levels (30%) indicates that the degradation is distributed rather than concentrated in sensitive features. This pattern suggests that the calibration process maintains its bias-reducing properties while the overall model uncertainty increases proportionally across the feature space [60]. The preservation of relative feature importance relationships under perturbation strengthens confidence in the stability of the fairness intervention.

These findings demonstrate that isotonic regression calibration maintains both predictive performance and fairness properties under realistic feature corruption scenarios that commonly occur in operational marketing analytics environments [61,62].

5. Discussion

The primary objective of this study was to empirically evaluate the effectiveness of probability calibration in reducing algorithmic bias associated with sensitive attributes in binary classification models. The results strongly support the hypothesis that calibrated models significantly reduce the disproportionate predictive influence of socially sensitive features such as country. Specifically, isotonic regression calibration notably decreased the marginal contribution of country-of-origin-related predictors, such as country_Serbia and country_Montenegro, by over 50% in both false-positive and false-negative predictions, directly aligning with this research hypothesis.

These findings provide empirical validation and extend previous theoretical and practical insights discussed in the literature [25,26,27] affirming that calibration techniques not only improve predictive reliability but also promote fairness. The considerable reduction in ECE and stratified Brier scores further underscores calibration’s dual role in enhancing probabilistic accuracy and mitigating bias simultaneously.

The use of SHAP values proved highly effective for quantifying feature-level bias in this study. The interpretability remained consistent across both calibrated and uncalibrated models, allowing for direct comparison of feature contributions. This consistency is particularly valuable for real-world deployment, as practitioners can monitor how calibration affects the influence of sensitive attributes without changing their interpretability framework. For practical implementation in lead generation systems, the approach in this study offers a straightforward post-processing step that can be applied to existing models without retraining, making it particularly attractive for organizations with established ML pipelines.

However, this study does have certain limitations, such as the highly imbalanced nature of the dataset and its specific marketing context, which might restrict the generalizability of findings to other fields or datasets with different characteristics. In addition, the absence of an external validation set means that the results, while based on proper cross-validation and held-out test sets, have not been validated on completely independent data from different time periods or geographic regions. As a result, the temporal stability and geographic transferability of the observed bias reduction effects cannot be assessed. Future research should explore calibration’s bias-reducing capabilities using diverse calibration methods, larger datasets, and across varied industry domains to assess the broader applicability and robustness of results that are demonstrated in this study.

6. Conclusions

In this study, the role of probability calibration in addressing algorithmic bias within binary classification models was explored. The results clearly indicate that applying calibration techniques can effectively reduce biases associated with sensitive attributes, thus promoting fairer and more reliable predictions.

The analysis showed that calibration methods, such as isotonic regression, notably decreased the influence of sensitive features, enhancing both fairness and the accuracy of predictive outcomes. A key contribution of this work is demonstrating that calibration affects bias at the feature level, not just at the group level. By using SHAP values to quantify individual feature contributions, the results from this study show that calibration specifically reduces the disproportionate influence of sensitive attributes in model predictions. This feature-level perspective provides more granular insights into how bias operates within models and how calibration interventions affect individual predictive factors.

These findings highlight the broader importance of probability calibration as a fundamental step toward ethical and transparent decision-making in machine learning. The significance of this research extends beyond traditional group-level fairness metrics to offer a new lens for understanding and mitigating algorithmic bias at its source: the individual features that drive predictions.

This study opens several promising avenues for future research. First, the feature-level bias analysis approach demonstrated here should be extended to develop more sophisticated methods for detecting and mitigating feature-specific biases. Future work could explore how different features interact to create compound biases and how calibration affects these interactions. Additionally, developing metrics that quantify feature-level fairness alongside traditional group-level metrics would provide a more comprehensive framework for assessing algorithmic bias.

Second, broader fairness testing should be conducted across multiple demographic variables simultaneously, examining intersectional biases that may arise when considering combinations of sensitive attributes, such as country of origin, gender, age, and socioeconomic indicators. Understanding how calibration affects these compound biases at both the feature and group levels would provide deeper insights into fairness-aware machine learning.

Third, the application of the calibration-based bias mitigation approach should be extended to other industries and data types. While the study was focused on lead generation in marketing, similar bias challenges are encountered in healthcare (patient risk scoring), finance (credit scoring), human resources (resume screening), and criminal justice (recidivism prediction). Each domain presents unique fairness requirements and constraints that could benefit from tailored calibration strategies. Additionally, extending this work to other data modalities beyond tabular data, such as text and image classification, would test the generalizability of calibration as a bias reduction technique.

Fourth, there is significant potential for developing specialized post-processing fairness calibration methods that explicitly optimize for both calibration quality and fairness metrics simultaneously. Current calibration techniques like isotonic regression and Platt scaling were not designed with fairness objectives in mind. Novel methods could incorporate fairness constraints directly into the calibration optimization process, potentially achieving better trade-offs between accuracy, calibration, and various fairness criteria. These methods could be designed to target specific features identified as sources of bias, enabling more precise bias mitigation strategies.

In conclusion, integrating calibration processes into machine learning workflows is crucial for mitigating bias and ensuring responsible use of predictive models across various real-world applications. The feature-level analysis approach pioneered in this study provides a more nuanced understanding of how calibration affects model fairness, moving beyond aggregate group statistics to examine the mechanistic sources of bias. As ML systems continue to influence critical decisions affecting human lives, the development of robust, fair, and well-calibrated models becomes not just a technical challenge, but an ethical imperative.

Author Contributions

Conceptualization, M.N.; investigation, M.N.; methodology, M.N. and D.N.; project administration, M.N.; supervision, D.N., M.S., S.K. and D.S.; validation, M.S. and D.S.; writing—original draft, M.N. and D.N.; writing—review and editing, D.N., M.S. and S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been supported by the Ministry of Science, Technological Development and Innovation through Contract No. 451-03-136/2025-03/200156.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Russo, D.D.; Milella, F.; Di Felice, G. Fairness in Healthcare Services for Italian Older People: A Convolution-Based Evaluation to Support Policy Decision Makers. Mathematics 2025, 13, 1448. [Google Scholar] [CrossRef]
Ueda, D.; Kakinuma, T.; Fujita, S.; Kamagata, K.; Fushimi, Y.; Ito, R.; Matsui, Y.; Nozaki, T.; Nakaura, T.; Fujima, N.; et al. Fairness of Artificial Intelligence in Healthcare: Review and Recommendations. Jpn. J. Radiol. 2024, 42, 3–15. [Google Scholar] [CrossRef] [PubMed]
Das, S.; Donini, M.; Gelman, J.; Haas, K.; Hardt, M.; Katzman, J.; Kenthapadi, K.; Larroy, P.; Yilmaz, P.; Zafar, M.B. Fairness Measures for Machine Learning in Finance. J. Financ. Data Sci. 2021, 3, 33–64. [Google Scholar] [CrossRef]
Akter, S.; Dwivedi, Y.K.; Sajib, S.; Biswas, K.; Bandara, R.J.; Michael, K. Algorithmic Bias in Machine Learning-Based Marketing Models. J. Bus. Res. 2022, 144, 201–216. [Google Scholar] [CrossRef]
Rodolfa, T.K.; Lamba, H.; Ghani, R. Empirical Observation of Negligible Fairness-Accuracy Trade-Offs in Machine Learning for Public Policy. Nat. Mach. Intell. 2021, 3, 896–904. [Google Scholar] [CrossRef]
Tien Dung, P.; Giudici, P. Sustainability, Accuracy, Fairness, and Explainability (SAFE) Machine Learning in Quantitative Trading. Mathematics 2025, 13, 442. [Google Scholar] [CrossRef]
Goellner, S.; Tropmann-Frick, M.; Brumen, B. Responsible Artificial Intelligence: A Structured Literature Review. arXiv 2024, arXiv:2403.06910. [Google Scholar] [CrossRef]
Fonseca, P.G.; Lopes, H.D. Calibration of Machine Learning Classifiers for Probability of Default Modelling. arXiv 2017, arXiv:1710.08901. [Google Scholar] [CrossRef]
Ojeda, F.M.; Baker, S.G.; Ziegler, A. Calibrating Machine Learning Approaches for Probability Estimation: A Comprehensive Comparison. Stat. Med. 2023, 42, 4212–4215. [Google Scholar] [CrossRef]
Pleiss, G.; Raghavan, M.; Wu, F.; Kleinberg, J.; Weinberger, K.Q. On Fairness and Calibration. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5680–5689. [Google Scholar]
Brahmbhatt, A.; Rathore, V.; Singla, P. Towards Fair and Calibrated Models. arXiv 2023, arXiv:2310.10399. [Google Scholar]
Chen, I.; Johansson, F.D.; Sontag, D. Why Is My Classifier Discriminatory? In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Zhang, Z.; Neill, D.B. Identifying Significant Predictive Bias in Classifiers. arXiv 2016, arXiv:1611.08292. [Google Scholar]
Barocas, S.; Selbst, A.D. Big Data’s Disparate Impact. Calif. Law Rev. 2016, 104, 671–732. [Google Scholar] [CrossRef]
O’Neil, C. Weapons of Math Destruction; Crown Publishing Group: New York, NY, USA, 2016. [Google Scholar]
Hardt, M.; Price, E.; Srebro, N. Equality of Opportunity in Supervised Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; Volume 29, pp. 3315–3323. [Google Scholar]
Buolamwini, J.; Gebru, T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability, and Transparency (FAT*), New York, NY, USA, 23–24 February 2018; Volume 81, pp. 77–91. [Google Scholar]
Najibi, A.; Shu, F.; Bouzerdoum, A. Bias and Fairness in Computer Vision Applications: A Survey. IEEE Access 2021, 9, 141119–141133. [Google Scholar] [CrossRef]
Chouldechova, A. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data 2017, 5, 153–163. [Google Scholar] [CrossRef]
Kleinberg, J.; Mullainathan, S.; Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. arXiv 2016, arXiv:1609.05807. [Google Scholar]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, NSW, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
Platt, J. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers; MIT Press: Cambridge, MA, USA, 1999; pp. 61–74. [Google Scholar]
Naeini, M.P.; Cooper, G.F.; Hauskrecht, M. Obtaining Well Calibrated Probabilities Using Bayesian Binning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; pp. 2901–2907. [Google Scholar]
Zadrozny, B.; Elkan, C. Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers. In Proceedings of the 18th International Conference on Machine Learning (ICML), San Francisco, CA, USA, 28 June–1 July 2001; pp. 609–616. [Google Scholar]
Chen, Z.; Zhang, J.M.; Sarro, F.; Harman, M. A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers. ACM Trans. Softw. Eng. Methodol. 2023, 32, 106. [Google Scholar] [CrossRef]
Raff, E.; Sylvester, J.; Mills, S. Fair Forests: Regularized Tree Induction to Minimize Model Bias. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 2–3 February 2018. [Google Scholar]
Corbett-Davies, S.; Pierson, E.; Feller, A.; Goel, S.; Huq, A. Algorithmic Decision Making and the Cost of Fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017. [Google Scholar]
Degordian. Creative and Digital Agency. 2025. Available online: https://degordian.com/ (accessed on 8 May 2025).
Hastie, T.; Tibshirani, R.; Friedman, J. An Introduction to Statistical Learning; Springer Series in Statistics; Springer: New York, NY, USA, 2001. [Google Scholar]
van den Goorbergh, R.; van Smeden, M.; Timmerman, D.; Van Calster, B. The Harm of Class Imbalance Corrections for Risk Prediction Models: Illustration and Simulation Using Logistic Regression. J. Am. Med. Inform. Assoc. 2022, 29, 1525–1534. [Google Scholar] [CrossRef]
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A Comparative Analysis of Gradient Boosting Algorithms. Artif. Intell. Rev. 2019, 54, 1937–1967. [Google Scholar] [CrossRef]
Caplin, A.; Martin, D.; Marx, P. Calibrating for Class Weight by Modeling Machine Learning. arXiv 2022. [Google Scholar] [CrossRef]
Phelps, N.; Lizotte, D.J.; Woolford, D.G. Using Platt’s Scaling for Calibration After Undersampling—Limitations and How to Address Them. arXiv 2024, arXiv:2410.18144. [Google Scholar] [CrossRef]
Pozzolo, A.D.; Caelen, O.; Johnson, R.A.; Bontempi, G. Calibrating Probability with Undersampling for Unbalanced Classification. In Proceedings of the IEEE Symposium Series on Computational Intelligence, Cape Town, South Africa, 7–10 December 2015; pp. 159–166. [Google Scholar]
George, B.R.; Ke, J.X.C.; DhakshinaMurthy, A.; Branco, P. The Effect of Resampling Techniques on the Performance of Machine Learning Clinical Risk Prediction Models in the Setting of Severe Class Imbalance: Development and Internal Validation in a Retrospective. Discov. Artif. Intell. 2024, 4, 1049–1065. [Google Scholar] [CrossRef]
Welvaars, K.; Oosterhoff, J.H.F.; van den Bekerom, M.P.J.; Doornberg, J.N.; van Haarst, E.P.; OLVG Urology Consortium; the Machine Learning Consortium. Implications of Resampling Data to Address the Class Imbalance Problem (IRCIP): An Evaluation of Impact on Performance Between Classification Algorithms in Medical Data. JAMIA Open 2023, 6, ooad033. [Google Scholar] [CrossRef] [PubMed]
Fathony, R.; Kolter, J.Z. AP-Perf: Incorporating Generic Performance Metrics in Differentiable Learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Okinawa, Japan, 16–18 April 2019. [Google Scholar]
Marcílio, W.E.; Eler, D.M. From Explanations to Feature Selection: Assessing SHAP Values as Feature Selection Mechanism. In Proceedings of the 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Porto de Galinhas, Brazil, 7–10 November 2020; pp. 340–347. [Google Scholar]
Rodríguez-Pérez, R.; Bajorath, J. Interpretation of Machine Learning Models Using Shapley Values: Application to Compound Potency and Multi-Target Activity Predictions. J. Comput.-Aided Mol. Des. 2020, 34, 1013–1026. [Google Scholar] [CrossRef]
Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
Bouthillier, X.; Delaunay, P.; Bronzi, M.; Trofimov, A.; Nichyporuk, B.; Szeto, J.; Sepah, N.; Raff, E.; Madan, K.; Voleti, V.; et al. Accounting for Variance in Machine Learning Benchmarks. In Proceedings of the Machine Learning and Systems, New York, NY, USA, 26 April 2021; Volume 3, pp. 747–769. [Google Scholar]
DiCiccio, T.J.; Efron, B. Bootstrap Confidence Intervals. Stat. Sci. 1996, 11, 189–228. [Google Scholar] [CrossRef]
Vaicenavicius, J.; Widmann, D.; Andersson, C.; Lindsten, F.; Roll, J.; Schön, T.B. Evaluating Model Calibration in Classification. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), Okinawa, Japan, 16–18 April 2019; Volume 89, pp. 3459–3467. [Google Scholar]
Hendrycks, D.; Dietterich, T. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Frénay, B.; Verleysen, M. Classification in the Presence of Label Noise: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 845–869. [Google Scholar] [CrossRef]
Rakin, A.S.; He, Z.; Fan, D. Parametric Noise Injection: Trainable Randomness to Improve Deep Neural Network Robustness against Adversarial Attack. arXiv 2018, arXiv:1811.09310. [Google Scholar]
Zanotto, S.; Aroyehun, S. Human Variability vs. Machine Consistency: A Linguistic Analysis of Texts Generated by Humans and Large Language Models. arXiv 2024, arXiv:2412.03025. [Google Scholar]
Ovadia, Y.; Fertig, E.; Ren, J.; Nado, Z.; Sculley, D.; Nowozin, S.; Dillon, J.V.; Lakshminarayanan, B.; Snoek, J. Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift. In Proceedings of the Advances in Neural Information Processing Systems, Red Hook, NY, USA, 8–14 December 2019. [Google Scholar]
Sáez, J.A.; Galar, M.; Luengo, J.; Herrera, F. Analyzing the Presence of Noise in Multi-Class Problems: Alleviating Its Influence with the One-vs-One Decomposition. Knowl. Inf. Syst. 2014, 38, 179–206. [Google Scholar] [CrossRef]
Huang, Y.; Gupta, S. Stable and Fair Classification. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July2022. [Google Scholar]
Corbett-Davies, S.; Goel, S. The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning. arXiv 2018, arXiv:1808.00023. [Google Scholar]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Red Hook, NY, USA, 4–9 December 2017; Volume 30, pp. 4765–4774. [Google Scholar]
Carlini, N.; Wagner, D. Towards Evaluating the Robustness of Neural Networks. In Proceedings of the IEEE Symposium on Security and Privacy, San Jose, CA, USA, 22–24 May 2017. [Google Scholar]
Papernot, N.; McDaniel, P.; Sinha, A.; Wellman, M.P. SoK: Security and Privacy in Machine Learning. In Proceedings of the IEEE European Symposium on Security and Privacy, London, UK, 24–26 April 2018; pp. 399–414. [Google Scholar]
Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with Noisy Labels. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Volume 26, pp. 1196–1204. [Google Scholar]
Xiao, H.; Xiao, H.; Eckert, C. Adversarial Label Flips Attack on Support Vector Machines. In Proceedings of the European Conference on Artificial Intelligence, Amsterdam, The Netherlands, 27–31 August 2012; pp. 870–875. [Google Scholar]
Biggio, B.; Nelson, B.; Laskov, P. Poisoning Attacks against Support Vector Machines. In Proceedings of the International Conference on Machine Learning, Madison, WI, USA, 26 June–1 July 2012; pp. 1467–1474. [Google Scholar]
Niculescu-Mizil, A.; Caruana, R. Predicting Good Probabilities with Supervised Learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 7–11 August 2005. [Google Scholar]
Kull, M.; Silva Filho, T.; Flach, P. Beta Calibration: A Well-Founded and Easily Implemented Improvement on Logistic Calibration for Binary Classifiers. In Proceedings of the Artificial Intelligence and Statistics, Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
Kumar, A.; Liang, P.S.; Ma, T. Verified Uncertainty Calibration. In Proceedings of the Advances in Neural Information Processing Systems, Red Hook, NY, USA, 8 December 2019. [Google Scholar]
Verma, S.; Rubin, J. Fairness Definitions Explained. In Proceedings of the IEEE/ACM International Workshop on Software Fairness, New York, NY, USA, 29 May 2018. [Google Scholar]
Mitchell, S.; Potash, E.; Barocas, S.; D’Amour, A.; Lum, K. Algorithmic Fairness: Choices, Assumptions, and Definitions. Annu. Rev. Stat. Its Appl. 2021, 8, 141–163. [Google Scholar] [CrossRef]

Figure 1. Research framework.

Figure 2. Models’ evaluation on training, cross-validation, and test sets.

Figure 3. XGBoost’s (base model) evaluation metrics.

Figure 4. Platt scaling-calibrated base model’s evaluation metrics.

Figure 5. Temperature scaling-calibrated base model’s evaluation metrics.

Figure 6. Isotonic regression-calibrated base model’s evaluation metrics.

Figure 7. Isotonic regression-calibrated baseline model vs. XGBoost.

Table 1. Summary statistics for ratio predictors.

	days_to_contact (Days)	Logins (Count)	number_of_visits (Count)
Min.	0	0	3
1st Quartile	0	0	3
Median	0	1	3
Mean	49	7	8.99
3rd Quartile	37	3	7
Max	225	24	746

Table 2. Hyperparameter search space for chosen algorithms.

Algorithm	Hyperparameter	Value Tested
Binary Logistic Regression	L2 regularization	0, 0.1, 0.2, 0.7
Polynomial Logistic Regression	Polynomial degree *	2, 3, 4
	L2 regularization	0, 0.1, 0.2, 0.7
Random Forest	Number of trees	50, 100, 300, 500
	Maximum depth	3, 5, 6, 10
	Min. entropy reduction threshold	0, 0.05, 0.1
XGBoost	Number of trees	50, 100, 300, 500, 700, 800
	L2 regularization	0, 0.1, 0.2, 0.7

* Polynominal degree is a feature-engineering choice, not a Logistic Regression hyperparameter; it is listed only to document the values explored during tuning.

Table 3. Mean marginal contributions of predictors (XGBoost).

Predictor	$\bar{{Δ P}_{F P}}$	$\bar{{Δ P}_{F N}}$
form	0.01	–0.02
country_Germany	0.02	–0.03
country_Croatia	0.03	–0.04
country_Serbia	0.14	–0.11
country_Montenegro	0.12	–0.13
country_Slovenia	0.04	–0.05
country_Italy	0.02	–0.03
projects	0.05	–0.06
logins	0.03	–0.01
email (business = 1)	0.01	–0.02
days_to_contact	0.02	–0.04
number_of_visits	0.02	–0.03

Table 4. Mean marginal contributions of predictors (isotonic regression).

Predictor	$\bar{{Δ P}_{F P}}$	$\bar{{Δ P}_{F N}}$
form	0.01	–0.02
country_Germany	0.02	–0.03
country_Croatia	0.03	–0.04
country_Serbia	0.05	–0.05
country_Montenegro	0.04	–0.06
country_Slovenia	0.03	–0.04
country_Italy	0.02	–0.03
projects	0.05	–0.06
logins	0.03	–0.01
email (business = 1)	0.01	–0.02
days_to_contact	0.02	–0.04
number_of_visits	0.02	–0.03

Table 5. Robustness analysis under noise injection, with performance metrics.

Noise Level	ECE	BS (+)	BS(−)	PPV (Germany)	PPV (Croatia)	PPV (Slovenia)	PPV (Italy)	PPV (Serbia)	PPV (Montenegro)	F2
0%	0.06	0.05	0.05	0.80	0.77	0.75	0.73	0.45	0.42	0.85
10%	0.06	0.05	0.05	0.79	0.75	0.74	0.73	0.44	0.42	0.83
20%	0.07	0.06	0.06	0.79	0.75	0.74	0.72	0.42	0.41	0.83
30%	0.08	0.06	0.06	0.77	0.74	0.73	0.70	0.42	0.41	0.82

Table 6. Robustness analysis under noise injection, with features’ marginal contributions.

Predictor	$\bar{{Δ P}_{F P}} / \bar{{Δ P}_{F N}}$ (10% Noise)	$\bar{{Δ P}_{F P}} / \bar{{Δ P}_{F N}}$ (20% Noise)	$\bar{{Δ P}_{F P}} / \bar{{Δ P}_{F N}}$ (30% Noise)
form	0.02/–0.02	0.05/–0.09	0.04/–0.07
country_Germany	0.04/−0.06	0.05/−0.02	0.06/−0.05
country_Croatia	0.04/−0.04	0.01/−0.08	0.06/−0.07
country_Serbia	0.04/0.07	0.04/0.03	0.08/0.03
country_Montenegro	0.03/−0.05	0.09/−0.09	0.08/−0.09
country_Slovenia	0.03/−0.07	0.04/−0.05	0.06/−0.09
country_Italy	0.02/−0.07	0.08/−0.05	0.03/−0.06
projects	0.03/−0.04	0.08/−0.09	0.05/−0.07
logins	0.07/−0.02	0.07/−0.01	0.07/−0.03
email (business = 1)	0.01/−0.04	0.03/−0.05	0.04/−0.05
days_to_contact	0.05/−0.09	0.05/−0.05	0.06/−0.07
number_of_visits	0.07/−0.05	0.01/−0.03	0.04/−0.04

Table 7. Robustness analysis under feature perturbation, with performance metrics.

Feature Perturbation Level	ECE	BS(+)	BS(−)	PPV (Germany)	PPV (Croatia)	PPV (Slovenia)	PPV (Italy)	PPV (Serbia)	PPV (Montenegro)	F2
0%	0.06	0.05	0.05	0.80	0.77	0.75	0.73	0.45	0.42	0.85
10%	0.06	0.05	0.05	0.79	0.75	0.74	0.73	0.44	0.42	0.84
20%	0.07	0.07	0.07	0.78	0.74	0.73	0.72	0.43	0.41	0.83
30%	0.07	0.07	0.07	0.77	0.73	0.72	0.71	0.42	0.40	0.81

Table 8. Robustness analysis under feature perturbation, with features’ marginal contributions.

Predictor	$\bar{{Δ P}_{F P}} / \bar{{Δ P}_{F N}}$ (10% Noise)	$\bar{{Δ P}_{F P}} / \bar{{Δ P}_{F N}}$ (20% Noise)	$\bar{{Δ P}_{F P}} / \bar{{Δ P}_{F N}}$ (30% Noise)
form	0.03/–0.03	0.06/–0.08	0.08/–0.10
country_Germany	0.05/−0.05	0.07/−0.06	0.09/−0.08
country_Croatia	0.04/−0.05	0.06/−0.09	0.09/−0.11
country_Serbia	0.05/0.06	0.07/0.05	0.10/0.06
country_Montenegro	0.04/−0.06	0.08/−0.10	0.11/−0.12
country_Slovenia	0.04/−0.06	0.07/−0.08	0.09/−0.11
country_Italy	0.03/−0.06	0.07/−0.08	0.08/−0.10
projects	0.04/−0.05	0.08/−0.10	0.10/−0.11
logins	0.06/−0.03	0.08/−0.04	0.10/−0.06
email (business = 1)	0.03/−0.05	0.05/−0.07	0.07/−0.09
days_to_contact	0.05/−0.08	0.08/−0.09	0.10/−0.12
number_of_visits	0.06/−0.05	0.08/−0.07	0.10/−0.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nikolić, M.; Nikolić, D.; Stefanović, M.; Koprivica, S.; Stefanović, D. Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data. Mathematics 2025, 13, 2183. https://doi.org/10.3390/math13132183

AMA Style

Nikolić M, Nikolić D, Stefanović M, Koprivica S, Stefanović D. Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data. Mathematics. 2025; 13(13):2183. https://doi.org/10.3390/math13132183

Chicago/Turabian Style

Nikolić, Miroslav, Danilo Nikolić, Miroslav Stefanović, Sara Koprivica, and Darko Stefanović. 2025. "Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data" Mathematics 13, no. 13: 2183. https://doi.org/10.3390/math13132183

APA Style

Nikolić, M., Nikolić, D., Stefanović, M., Koprivica, S., & Stefanović, D. (2025). Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data. Mathematics, 13(13), 2183. https://doi.org/10.3390/math13132183

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data

Abstract

1. Introduction

2. Literature Review

2.1. Societal Bias in Algorithmic Decision-Making

2.2. Fairness Metrics

2.3. Calibration Techniques in Binary Classification

2.3.1. Calibration Algorithms

2.3.2. Evaluation Metrics for Probability Calibration

2.4. Impact of Calibration on Feature Bias

3. Methodology

3.1. Dataset Characteristics

3.2. Data Pre-Processing

3.3. Pipeline Evaluation Design

3.3.1. Model Selection and Hyperparameter Tuning

3.3.2. Evaluation Metrics

3.3.3. Sensitive Variable Marginal Contribution

4. Results

4.1. Models Performance

4.2. Base Model Evaluation

4.3. Post Hoc Model Calibration

4.4. Fairness and Marginal Contribution Insights

4.5. Robustness Analysis

4.5.1. Noise Injection Analysis

4.5.2. Feature Perturbation Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI