Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data
Abstract
1. Introduction
2. Literature Review
2.1. Societal Bias in Algorithmic Decision-Making
2.2. Fairness Metrics
- Demographic parity demands equal probability of positive outcomes across groups, irrespective of actual outcomes [19]. Formally, demographic parity is defined as:
- Equalized odds represent another prevalent fairness criterion that was introduced in [16]. It requires equal true-positive and false-positive rates across groups, formally expressed as:
- Predictive parity focuses on calibration fairness and mandates equal predictive accuracy across groups, thus ensuring that predictions reflect true likelihoods consistently among demographics:
2.3. Calibration Techniques in Binary Classification
2.3.1. Calibration Algorithms
- Platt scaling, proposed by [22], is a parametric method that transforms a model’s raw output scores (logits) into calibrated probabilities using a Logistic Regression. It models the relationship between logits and true probabilities as a sigmoid function:
- Isotonic regression, introduced by [24], is a non-parametric calibration method that fits a piecewise constant and monotonically increasing function to transform predicted scores into calibrated probabilities. It minimizes the squared error between the calibrated probabilities and true outcomes, subject to a monotonicity constraint:
- Temperature scaling, popularized by [21] in the context of deep learning, is a parametric method that adjusts logits using a single scalar parameter, T, before applying the softmax function. Hence,
2.3.2. Evaluation Metrics for Probability Calibration
2.4. Impact of Calibration on Feature Bias
3. Methodology
3.1. Dataset Characteristics
- Independent variables:
- form (nominal variable) indicates the channel through which the prospect contacted the agency, with possible values including email or contact form.
- country_of_origin (nominal variable) captures the prospect’s self-reported location, restricted to one of six countries: Serbia, Slovenia, Montenegro, Croatia, Germany, or Italy.
- email (nominal variable) is the email addresses of the prospects;
- days_to_contact (ratio variable) specifies the number of days elapsed between the prospect’s first recorded visit to the agency’s website and the moment they initiated contact (via a form or email). Zero indicates a same-day visit and a contact.
- logins (ratio variable) represents how many times the prospect logged into a free account on the agency’s platform before making a formal contact (through email or contact form). A value of zero implies that the user did not log in at all.
- number_of_visits (ratio variable) specifies the number of visits to the agency’s website before making a formal contact.
- projects (nominal variable) denotes potential projects or services (e.g., SEO, web analytics, online advertising) for collaboration that the prospect stated in the contact form or email.
- Dependent variable:
- collaboration (ordinal variable): This variable has binary labels that reflect historical outcomes linking a prospective contact to a successful partnership. The dataset is very unbalanced, with successful collaboration being presented in 1.4% of observations.
3.2. Data Pre-Processing
- form was binary encoded as 1 when it was an email and 0 otherwise.
- email categories were classified as business (1) or private (0) based on their domain, where emails from known consumer service providers (Gmail, Outlook, Yahoo, Hotmail, AOL, iCloud, and other free email services) were coded as private (0). All other domains, including corporate, educational, government, and organizational domains, were coded as business (1). This classification was performed by extracting the domain portion of each email address and comparing it against a predefined list of consumer email providers.
- The country_of_origin variable was encoded using OneHot encoding, which created binary dummy variables for each unique country in the dataset. This transformation converted the categorical country variable into k-1 binary columns (where k represents the number of unique countries), with each column indicating the presence (1) or absence (0) of a specific country in the dataset. One country category was dropped as the reference group to avoid multicollinearity in the regression analysis [29].
- The projects variable was encoded with cardinality encoding by transforming each row’s multi-valued category set into a single integer that represents the total number of distinct categories in that row (range: 0 to 41). Cardinality encoding was chosen for the projects variable to avoid the curse of dimensionality that OneHot encoding would create (41 columns plus their combinations for multi-label data) while maintaining better interpretability than binary encoding. Binary encoding would transform project combinations into binary numbers (e.g., 101101), making it difficult to interpret the relationship between encoded values and actual project selections. Cardinality encoding preserves a meaningful business interpretation—the count of selected projects directly represents the scope and complexity of the prospect’s needs, where higher values indicate more comprehensive service requirements.
- days_to_contact (range: 0 to 225);
- logins (range: 0 to 24);
- number_of_visits (range: 3 to 746).
3.3. Pipeline Evaluation Design
3.3.1. Model Selection and Hyperparameter Tuning
- Binary Logistic Regression is a linear model that uses the Sigmoid function and produces the probability of the positive class as the output. Values 0, 0.1, 0.2, and 0.7 were tested for the regularization hyperparameter. The binary cross-entropy was used as the cost function. The regularization introduces a penalty to the model’s cost function:
- Polynomial Logistic Regression extends the linear boundary to higher-degree polynomials. Two- to four-degree polynomials were tested while keeping the values the same as in the Binary Logistic Regression.
- Random Forest is a tree-based bootstrap aggregation ensemble algorithm that makes predictions based on the hard voting of multiple decision trees [30]. The number of trees (50, 100, 300, 500), maximum depth (3, 5, 6, 10), and minimum entropy reduction threshold (0, 0.05, 0.1) were varied in the hyperparameter fine-tuning process.
- eXtreme Gradient Boosting (XGBoost) is a tree-based gradient boosting ensemble algorithm. The algorithm demonstrates two underlying differences in comparison to the Random Forest algorithm: (i) it trains decision trees and minimizes the cost sequentially, and (ii) it is not based on the sampling with replacement where each new observation has the uniform probability to be selected (), but rather weights more the misclassified observations from the previous trees in order to minimize the cost sequentially (it works on the residuals improvements (Residual improvements refer to the iterative error correction process in XGBoost. Each new tree is trained to predict the errors (residuals) made by all previous trees combined. For example, if the first tree predicts a value of 7 but the actual value is 10, the residual is 3. The next tree then focuses on predicting this error of 3, progressively reducing the overall prediction error with each sequential tree [31]) at each sequence) [31]. In the hyperparameter fine-tuning process, the number of trees (50, 100, 300, 500, 700, and 800) and L2 regularization (0, 0.1, 0.2, and 0.7) were tested.
3.3.2. Evaluation Metrics
- In the model selection phase, all models were assessed using the evaluation metric, emphasizing recall to minimize false negatives in prospective collaborations while considering precision:Recent work has shown that non-decomposable metrics such as F can be embedded directly into a differentiable objective, eliminating the need for surrogate losses or heuristic resampling [37].
- Probability calibration of the best-performing model from the first phase was evaluated with expected calibration error and stratified BS for positive and negative classes.
- In the last phase, the performance of the best-performing model from the first phase was explored with the performance of a calibrated best-performing model from the first phase. This was carried out using four evaluation metrics: (i) (ii) probability calibration metrics (ECE and stratified Brier scores), (iii) the marginal contribution of a sensitive variable in the model’s predictions, and (iv) predictive parity. Evaluation metrics (iii) and (iv) were used to evaluate the fairness of the model.
3.3.3. Sensitive Variable Marginal Contribution
4. Results
4.1. Models Performance
- Binary Logistic Regression (best hyperparameters: ): It achieved moderate scores across training (0.56), cross-validation (0.53), and test sets (0.52). Despite its simplicity, it served as a benchmark for comparing more advanced models.
- Polynomial Logistic Regression with second-degree polynomials (best hyperparameters: ): Introducing second-degree polynomials improved the score to 0.58 (training set), 0.57 (cross-validation set), and 0.54 (test set). The additional polynomial terms helped capture non-linear relationships but slightly increased the risk of overfitting.
- Polynomial Logistic Regression with third-degree polynomials (best hyperparameters: ): With third-degree polynomial terms, the model reached an average training score of 0.58, a cross-validation score of 0.56, and a test score of 0.54;
- Polynomial Logistic Regression with fourth-degree polynomials (best hyperparameters: ): The further increase in polynomial complexity achieved the training score of (0.57), the cross-validation score of (0.55), and test score of (0.51).
- Random Forest (best hyperparameters: number of trees = 100, , maximum depth = 3, minimum entropy reduction = 0): Random Forest provided higher scores with slightly higher variance, achieving a 0.81 score on the training set, 0.70 score on the cross-validation set, and 0.67 score on the test set. The ensemble nature of the algorithm helped stabilize predictions, although increasing the forest size or depth did not consistently yield higher scores. The higher l2 regularization partially mitigated the overfitting but did not fully close the created gap.
- eXtreme Gradient Boosting (best hyperparameters: number of trees = 300, ): XGBoost emerged as the top-performing algorithm, with the highest scores across all three datasets: 0.87 (training set), 0.82 (cross-validation set), and 0.80 (test set). Its sequential tree-building process appears to be better suited to capture the complex patterns in the unbalanced datasets, while moderate l2 regularization prevented severe overfitting. Furthermore, XGBoost exhibited stable performance with narrow 95% confidence intervals, indicating low variance across cross-validation folds.
4.2. Base Model Evaluation
- Calibration quality: ECE and stratified Brier scores showed that, although the model discriminates well (high ), its probability outputs remain overconfident. This is a common problem present in gradient boosting ensemble algorithms [10]. Both class-specific Brier scores sit far above the irreducible noise for this prevalence, confirming that raw scores cannot be interpreted as reliable probabilities.
- Fairness implications: The predictive parity values shown in Figure 3 were computed as P(Y = 1|Ŷ = 1, A = country of origin) for each country of origin, which equals the precision (positive predictive value) for each country of origin subgroup as defined in Equation (3) of Section 2.2. Perfect predictive parity would require these values to be equal across all countries of origin. However, the observed values range from 0.78 (Germany) to 0.28 (Montenegro), revealing substantial disparities. Because predictive parity equals subgroup-specific precision (Section 2.2), the 0.78 → 0.28 spread revealed a systematic underperformance for Serbian and Montenegrin prospects. Two factors compounded this:
- Calibration deficit: over-confident probabilities inflate false positives disproportionately in low-base-rate countries.
- Sample-size imbalance: Serbia (372) and Montenegro (326) contribute <25% of the German sample (1123), raising variance in posterior estimates.
4.3. Post Hoc Model Calibration
4.4. Fairness and Marginal Contribution Insights
- The test set was resampled with replacement while maintaining class proportions;
- ECE and Brier scores were calculated for both uncalibrated and calibrated models;
- The difference in metrics between calibrated and uncalibrated versions was computed.
4.5. Robustness Analysis
4.5.1. Noise Injection Analysis
4.5.2. Feature Perturbation Analysis
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Russo, D.D.; Milella, F.; Di Felice, G. Fairness in Healthcare Services for Italian Older People: A Convolution-Based Evaluation to Support Policy Decision Makers. Mathematics 2025, 13, 1448. [Google Scholar] [CrossRef]
- Ueda, D.; Kakinuma, T.; Fujita, S.; Kamagata, K.; Fushimi, Y.; Ito, R.; Matsui, Y.; Nozaki, T.; Nakaura, T.; Fujima, N.; et al. Fairness of Artificial Intelligence in Healthcare: Review and Recommendations. Jpn. J. Radiol. 2024, 42, 3–15. [Google Scholar] [CrossRef] [PubMed]
- Das, S.; Donini, M.; Gelman, J.; Haas, K.; Hardt, M.; Katzman, J.; Kenthapadi, K.; Larroy, P.; Yilmaz, P.; Zafar, M.B. Fairness Measures for Machine Learning in Finance. J. Financ. Data Sci. 2021, 3, 33–64. [Google Scholar] [CrossRef]
- Akter, S.; Dwivedi, Y.K.; Sajib, S.; Biswas, K.; Bandara, R.J.; Michael, K. Algorithmic Bias in Machine Learning-Based Marketing Models. J. Bus. Res. 2022, 144, 201–216. [Google Scholar] [CrossRef]
- Rodolfa, T.K.; Lamba, H.; Ghani, R. Empirical Observation of Negligible Fairness-Accuracy Trade-Offs in Machine Learning for Public Policy. Nat. Mach. Intell. 2021, 3, 896–904. [Google Scholar] [CrossRef]
- Tien Dung, P.; Giudici, P. Sustainability, Accuracy, Fairness, and Explainability (SAFE) Machine Learning in Quantitative Trading. Mathematics 2025, 13, 442. [Google Scholar] [CrossRef]
- Goellner, S.; Tropmann-Frick, M.; Brumen, B. Responsible Artificial Intelligence: A Structured Literature Review. arXiv 2024, arXiv:2403.06910. [Google Scholar] [CrossRef]
- Fonseca, P.G.; Lopes, H.D. Calibration of Machine Learning Classifiers for Probability of Default Modelling. arXiv 2017, arXiv:1710.08901. [Google Scholar] [CrossRef]
- Ojeda, F.M.; Baker, S.G.; Ziegler, A. Calibrating Machine Learning Approaches for Probability Estimation: A Comprehensive Comparison. Stat. Med. 2023, 42, 4212–4215. [Google Scholar] [CrossRef]
- Pleiss, G.; Raghavan, M.; Wu, F.; Kleinberg, J.; Weinberger, K.Q. On Fairness and Calibration. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5680–5689. [Google Scholar]
- Brahmbhatt, A.; Rathore, V.; Singla, P. Towards Fair and Calibrated Models. arXiv 2023, arXiv:2310.10399. [Google Scholar]
- Chen, I.; Johansson, F.D.; Sontag, D. Why Is My Classifier Discriminatory? In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
- Zhang, Z.; Neill, D.B. Identifying Significant Predictive Bias in Classifiers. arXiv 2016, arXiv:1611.08292. [Google Scholar]
- Barocas, S.; Selbst, A.D. Big Data’s Disparate Impact. Calif. Law Rev. 2016, 104, 671–732. [Google Scholar] [CrossRef]
- O’Neil, C. Weapons of Math Destruction; Crown Publishing Group: New York, NY, USA, 2016. [Google Scholar]
- Hardt, M.; Price, E.; Srebro, N. Equality of Opportunity in Supervised Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; Volume 29, pp. 3315–3323. [Google Scholar]
- Buolamwini, J.; Gebru, T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability, and Transparency (FAT*), New York, NY, USA, 23–24 February 2018; Volume 81, pp. 77–91. [Google Scholar]
- Najibi, A.; Shu, F.; Bouzerdoum, A. Bias and Fairness in Computer Vision Applications: A Survey. IEEE Access 2021, 9, 141119–141133. [Google Scholar] [CrossRef]
- Chouldechova, A. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data 2017, 5, 153–163. [Google Scholar] [CrossRef]
- Kleinberg, J.; Mullainathan, S.; Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. arXiv 2016, arXiv:1609.05807. [Google Scholar]
- Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, NSW, Australia, 6–11 August 2017; pp. 1321–1330. [Google Scholar]
- Platt, J. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in Large Margin Classifiers; MIT Press: Cambridge, MA, USA, 1999; pp. 61–74. [Google Scholar]
- Naeini, M.P.; Cooper, G.F.; Hauskrecht, M. Obtaining Well Calibrated Probabilities Using Bayesian Binning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; pp. 2901–2907. [Google Scholar]
- Zadrozny, B.; Elkan, C. Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers. In Proceedings of the 18th International Conference on Machine Learning (ICML), San Francisco, CA, USA, 28 June–1 July 2001; pp. 609–616. [Google Scholar]
- Chen, Z.; Zhang, J.M.; Sarro, F.; Harman, M. A Comprehensive Empirical Study of Bias Mitigation Methods for Machine Learning Classifiers. ACM Trans. Softw. Eng. Methodol. 2023, 32, 106. [Google Scholar] [CrossRef]
- Raff, E.; Sylvester, J.; Mills, S. Fair Forests: Regularized Tree Induction to Minimize Model Bias. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 2–3 February 2018. [Google Scholar]
- Corbett-Davies, S.; Pierson, E.; Feller, A.; Goel, S.; Huq, A. Algorithmic Decision Making and the Cost of Fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017. [Google Scholar]
- Degordian. Creative and Digital Agency. 2025. Available online: https://degordian.com/ (accessed on 8 May 2025).
- Hastie, T.; Tibshirani, R.; Friedman, J. An Introduction to Statistical Learning; Springer Series in Statistics; Springer: New York, NY, USA, 2001. [Google Scholar]
- van den Goorbergh, R.; van Smeden, M.; Timmerman, D.; Van Calster, B. The Harm of Class Imbalance Corrections for Risk Prediction Models: Illustration and Simulation Using Logistic Regression. J. Am. Med. Inform. Assoc. 2022, 29, 1525–1534. [Google Scholar] [CrossRef]
- Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A Comparative Analysis of Gradient Boosting Algorithms. Artif. Intell. Rev. 2019, 54, 1937–1967. [Google Scholar] [CrossRef]
- Caplin, A.; Martin, D.; Marx, P. Calibrating for Class Weight by Modeling Machine Learning. arXiv 2022. [Google Scholar] [CrossRef]
- Phelps, N.; Lizotte, D.J.; Woolford, D.G. Using Platt’s Scaling for Calibration After Undersampling—Limitations and How to Address Them. arXiv 2024, arXiv:2410.18144. [Google Scholar] [CrossRef]
- Pozzolo, A.D.; Caelen, O.; Johnson, R.A.; Bontempi, G. Calibrating Probability with Undersampling for Unbalanced Classification. In Proceedings of the IEEE Symposium Series on Computational Intelligence, Cape Town, South Africa, 7–10 December 2015; pp. 159–166. [Google Scholar]
- George, B.R.; Ke, J.X.C.; DhakshinaMurthy, A.; Branco, P. The Effect of Resampling Techniques on the Performance of Machine Learning Clinical Risk Prediction Models in the Setting of Severe Class Imbalance: Development and Internal Validation in a Retrospective. Discov. Artif. Intell. 2024, 4, 1049–1065. [Google Scholar] [CrossRef]
- Welvaars, K.; Oosterhoff, J.H.F.; van den Bekerom, M.P.J.; Doornberg, J.N.; van Haarst, E.P.; OLVG Urology Consortium; the Machine Learning Consortium. Implications of Resampling Data to Address the Class Imbalance Problem (IRCIP): An Evaluation of Impact on Performance Between Classification Algorithms in Medical Data. JAMIA Open 2023, 6, ooad033. [Google Scholar] [CrossRef] [PubMed]
- Fathony, R.; Kolter, J.Z. AP-Perf: Incorporating Generic Performance Metrics in Differentiable Learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Okinawa, Japan, 16–18 April 2019. [Google Scholar]
- Marcílio, W.E.; Eler, D.M. From Explanations to Feature Selection: Assessing SHAP Values as Feature Selection Mechanism. In Proceedings of the 33rd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Porto de Galinhas, Brazil, 7–10 November 2020; pp. 340–347. [Google Scholar]
- Rodríguez-Pérez, R.; Bajorath, J. Interpretation of Machine Learning Models Using Shapley Values: Application to Compound Potency and Multi-Target Activity Predictions. J. Comput.-Aided Mol. Des. 2020, 34, 1013–1026. [Google Scholar] [CrossRef]
- Efron, B.; Tibshirani, R.J. An Introduction to the Bootstrap; CRC Press: Boca Raton, FL, USA, 1994. [Google Scholar]
- Bouthillier, X.; Delaunay, P.; Bronzi, M.; Trofimov, A.; Nichyporuk, B.; Szeto, J.; Sepah, N.; Raff, E.; Madan, K.; Voleti, V.; et al. Accounting for Variance in Machine Learning Benchmarks. In Proceedings of the Machine Learning and Systems, New York, NY, USA, 26 April 2021; Volume 3, pp. 747–769. [Google Scholar]
- DiCiccio, T.J.; Efron, B. Bootstrap Confidence Intervals. Stat. Sci. 1996, 11, 189–228. [Google Scholar] [CrossRef]
- Vaicenavicius, J.; Widmann, D.; Andersson, C.; Lindsten, F.; Roll, J.; Schön, T.B. Evaluating Model Calibration in Classification. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), Okinawa, Japan, 16–18 April 2019; Volume 89, pp. 3459–3467. [Google Scholar]
- Hendrycks, D.; Dietterich, T. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Frénay, B.; Verleysen, M. Classification in the Presence of Label Noise: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2014, 25, 845–869. [Google Scholar] [CrossRef]
- Rakin, A.S.; He, Z.; Fan, D. Parametric Noise Injection: Trainable Randomness to Improve Deep Neural Network Robustness against Adversarial Attack. arXiv 2018, arXiv:1811.09310. [Google Scholar]
- Zanotto, S.; Aroyehun, S. Human Variability vs. Machine Consistency: A Linguistic Analysis of Texts Generated by Humans and Large Language Models. arXiv 2024, arXiv:2412.03025. [Google Scholar]
- Ovadia, Y.; Fertig, E.; Ren, J.; Nado, Z.; Sculley, D.; Nowozin, S.; Dillon, J.V.; Lakshminarayanan, B.; Snoek, J. Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift. In Proceedings of the Advances in Neural Information Processing Systems, Red Hook, NY, USA, 8–14 December 2019. [Google Scholar]
- Sáez, J.A.; Galar, M.; Luengo, J.; Herrera, F. Analyzing the Presence of Noise in Multi-Class Problems: Alleviating Its Influence with the One-vs-One Decomposition. Knowl. Inf. Syst. 2014, 38, 179–206. [Google Scholar] [CrossRef]
- Huang, Y.; Gupta, S. Stable and Fair Classification. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July2022. [Google Scholar]
- Corbett-Davies, S.; Goel, S. The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning. arXiv 2018, arXiv:1808.00023. [Google Scholar]
- Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Red Hook, NY, USA, 4–9 December 2017; Volume 30, pp. 4765–4774. [Google Scholar]
- Carlini, N.; Wagner, D. Towards Evaluating the Robustness of Neural Networks. In Proceedings of the IEEE Symposium on Security and Privacy, San Jose, CA, USA, 22–24 May 2017. [Google Scholar]
- Papernot, N.; McDaniel, P.; Sinha, A.; Wellman, M.P. SoK: Security and Privacy in Machine Learning. In Proceedings of the IEEE European Symposium on Security and Privacy, London, UK, 24–26 April 2018; pp. 399–414. [Google Scholar]
- Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with Noisy Labels. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Volume 26, pp. 1196–1204. [Google Scholar]
- Xiao, H.; Xiao, H.; Eckert, C. Adversarial Label Flips Attack on Support Vector Machines. In Proceedings of the European Conference on Artificial Intelligence, Amsterdam, The Netherlands, 27–31 August 2012; pp. 870–875. [Google Scholar]
- Biggio, B.; Nelson, B.; Laskov, P. Poisoning Attacks against Support Vector Machines. In Proceedings of the International Conference on Machine Learning, Madison, WI, USA, 26 June–1 July 2012; pp. 1467–1474. [Google Scholar]
- Niculescu-Mizil, A.; Caruana, R. Predicting Good Probabilities with Supervised Learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 7–11 August 2005. [Google Scholar]
- Kull, M.; Silva Filho, T.; Flach, P. Beta Calibration: A Well-Founded and Easily Implemented Improvement on Logistic Calibration for Binary Classifiers. In Proceedings of the Artificial Intelligence and Statistics, Lauderdale, FL, USA, 20–22 April 2017. [Google Scholar]
- Kumar, A.; Liang, P.S.; Ma, T. Verified Uncertainty Calibration. In Proceedings of the Advances in Neural Information Processing Systems, Red Hook, NY, USA, 8 December 2019. [Google Scholar]
- Verma, S.; Rubin, J. Fairness Definitions Explained. In Proceedings of the IEEE/ACM International Workshop on Software Fairness, New York, NY, USA, 29 May 2018. [Google Scholar]
- Mitchell, S.; Potash, E.; Barocas, S.; D’Amour, A.; Lum, K. Algorithmic Fairness: Choices, Assumptions, and Definitions. Annu. Rev. Stat. Its Appl. 2021, 8, 141–163. [Google Scholar] [CrossRef]
days_to_contact (Days) | Logins (Count) | number_of_visits (Count) | |
---|---|---|---|
Min. | 0 | 0 | 3 |
1st Quartile | 0 | 0 | 3 |
Median | 0 | 1 | 3 |
Mean | 49 | 7 | 8.99 |
3rd Quartile | 37 | 3 | 7 |
Max | 225 | 24 | 746 |
Algorithm | Hyperparameter | Value Tested |
---|---|---|
Binary Logistic Regression | L2 regularization | 0, 0.1, 0.2, 0.7 |
Polynomial Logistic Regression | Polynomial degree * | 2, 3, 4 |
L2 regularization | 0, 0.1, 0.2, 0.7 | |
Random Forest | Number of trees | 50, 100, 300, 500 |
Maximum depth | 3, 5, 6, 10 | |
Min. entropy reduction threshold | 0, 0.05, 0.1 | |
XGBoost | Number of trees | 50, 100, 300, 500, 700, 800 |
L2 regularization | 0, 0.1, 0.2, 0.7 |
Predictor | ||
---|---|---|
form | 0.01 | –0.02 |
country_Germany | 0.02 | –0.03 |
country_Croatia | 0.03 | –0.04 |
country_Serbia | 0.14 | –0.11 |
country_Montenegro | 0.12 | –0.13 |
country_Slovenia | 0.04 | –0.05 |
country_Italy | 0.02 | –0.03 |
projects | 0.05 | –0.06 |
logins | 0.03 | –0.01 |
email (business = 1) | 0.01 | –0.02 |
days_to_contact | 0.02 | –0.04 |
number_of_visits | 0.02 | –0.03 |
Predictor | ||
---|---|---|
form | 0.01 | –0.02 |
country_Germany | 0.02 | –0.03 |
country_Croatia | 0.03 | –0.04 |
country_Serbia | 0.05 | –0.05 |
country_Montenegro | 0.04 | –0.06 |
country_Slovenia | 0.03 | –0.04 |
country_Italy | 0.02 | –0.03 |
projects | 0.05 | –0.06 |
logins | 0.03 | –0.01 |
email (business = 1) | 0.01 | –0.02 |
days_to_contact | 0.02 | –0.04 |
number_of_visits | 0.02 | –0.03 |
Noise Level | ECE | BS (+) | BS(−) | PPV (Germany) | PPV (Croatia) | PPV (Slovenia) | PPV (Italy) | PPV (Serbia) | PPV (Montenegro) | F2 |
---|---|---|---|---|---|---|---|---|---|---|
0% | 0.06 | 0.05 | 0.05 | 0.80 | 0.77 | 0.75 | 0.73 | 0.45 | 0.42 | 0.85 |
10% | 0.06 | 0.05 | 0.05 | 0.79 | 0.75 | 0.74 | 0.73 | 0.44 | 0.42 | 0.83 |
20% | 0.07 | 0.06 | 0.06 | 0.79 | 0.75 | 0.74 | 0.72 | 0.42 | 0.41 | 0.83 |
30% | 0.08 | 0.06 | 0.06 | 0.77 | 0.74 | 0.73 | 0.70 | 0.42 | 0.41 | 0.82 |
Predictor | (10% Noise) | (20% Noise) | (30% Noise) |
---|---|---|---|
form | 0.02/–0.02 | 0.05/–0.09 | 0.04/–0.07 |
country_Germany | 0.04/−0.06 | 0.05/−0.02 | 0.06/−0.05 |
country_Croatia | 0.04/−0.04 | 0.01/−0.08 | 0.06/−0.07 |
country_Serbia | 0.04/0.07 | 0.04/0.03 | 0.08/0.03 |
country_Montenegro | 0.03/−0.05 | 0.09/−0.09 | 0.08/−0.09 |
country_Slovenia | 0.03/−0.07 | 0.04/−0.05 | 0.06/−0.09 |
country_Italy | 0.02/−0.07 | 0.08/−0.05 | 0.03/−0.06 |
projects | 0.03/−0.04 | 0.08/−0.09 | 0.05/−0.07 |
logins | 0.07/−0.02 | 0.07/−0.01 | 0.07/−0.03 |
email (business = 1) | 0.01/−0.04 | 0.03/−0.05 | 0.04/−0.05 |
days_to_contact | 0.05/−0.09 | 0.05/−0.05 | 0.06/−0.07 |
number_of_visits | 0.07/−0.05 | 0.01/−0.03 | 0.04/−0.04 |
Feature Perturbation Level | ECE | BS(+) | BS(−) | PPV (Germany) | PPV (Croatia) | PPV (Slovenia) | PPV (Italy) | PPV (Serbia) | PPV (Montenegro) | F2 |
---|---|---|---|---|---|---|---|---|---|---|
0% | 0.06 | 0.05 | 0.05 | 0.80 | 0.77 | 0.75 | 0.73 | 0.45 | 0.42 | 0.85 |
10% | 0.06 | 0.05 | 0.05 | 0.79 | 0.75 | 0.74 | 0.73 | 0.44 | 0.42 | 0.84 |
20% | 0.07 | 0.07 | 0.07 | 0.78 | 0.74 | 0.73 | 0.72 | 0.43 | 0.41 | 0.83 |
30% | 0.07 | 0.07 | 0.07 | 0.77 | 0.73 | 0.72 | 0.71 | 0.42 | 0.40 | 0.81 |
Predictor | (10% Noise) | (20% Noise) | (30% Noise) |
---|---|---|---|
form | 0.03/–0.03 | 0.06/–0.08 | 0.08/–0.10 |
country_Germany | 0.05/−0.05 | 0.07/−0.06 | 0.09/−0.08 |
country_Croatia | 0.04/−0.05 | 0.06/−0.09 | 0.09/−0.11 |
country_Serbia | 0.05/0.06 | 0.07/0.05 | 0.10/0.06 |
country_Montenegro | 0.04/−0.06 | 0.08/−0.10 | 0.11/−0.12 |
country_Slovenia | 0.04/−0.06 | 0.07/−0.08 | 0.09/−0.11 |
country_Italy | 0.03/−0.06 | 0.07/−0.08 | 0.08/−0.10 |
projects | 0.04/−0.05 | 0.08/−0.10 | 0.10/−0.11 |
logins | 0.06/−0.03 | 0.08/−0.04 | 0.10/−0.06 |
email (business = 1) | 0.03/−0.05 | 0.05/−0.07 | 0.07/−0.09 |
days_to_contact | 0.05/−0.08 | 0.08/−0.09 | 0.10/−0.12 |
number_of_visits | 0.06/−0.05 | 0.08/−0.07 | 0.10/−0.09 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nikolić, M.; Nikolić, D.; Stefanović, M.; Koprivica, S.; Stefanović, D. Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data. Mathematics 2025, 13, 2183. https://doi.org/10.3390/math13132183
Nikolić M, Nikolić D, Stefanović M, Koprivica S, Stefanović D. Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data. Mathematics. 2025; 13(13):2183. https://doi.org/10.3390/math13132183
Chicago/Turabian StyleNikolić, Miroslav, Danilo Nikolić, Miroslav Stefanović, Sara Koprivica, and Darko Stefanović. 2025. "Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data" Mathematics 13, no. 13: 2183. https://doi.org/10.3390/math13132183
APA StyleNikolić, M., Nikolić, D., Stefanović, M., Koprivica, S., & Stefanović, D. (2025). Mitigating Algorithmic Bias Through Probability Calibration: A Case Study on Lead Generation Data. Mathematics, 13(13), 2183. https://doi.org/10.3390/math13132183