A Comparison of Bias Mitigation Techniques for Educational Classification Tasks Using Supervised Machine Learning

Wongvorachan, Tarid; Bulut, Okan; Liu, Joyce Xinle; Mazzullo, Elisabetta

doi:10.3390/info15060326

Open AccessArticle

A Comparison of Bias Mitigation Techniques for Educational Classification Tasks Using Supervised Machine Learning

¹

Measurement, Evaluation, and Data Science, University of Alberta, Edmonton, AB T6G 2G5, Canada

²

Centre for Research in Applied Measurement and Evaluation, University of Alberta, Edmonton, AB T6G 2G5, Canada

^*

Author to whom correspondence should be addressed.

Information 2024, 15(6), 326; https://doi.org/10.3390/info15060326

Submission received: 14 May 2024 / Revised: 31 May 2024 / Accepted: 31 May 2024 / Published: 4 June 2024

(This article belongs to the Special Issue Real-World Applications of Machine Learning Techniques)

Download

Browse Figures

Versions Notes

Abstract

:

Machine learning (ML) has become integral in educational decision-making through technologies such as learning analytics and educational data mining. However, the adoption of machine learning-driven tools without scrutiny risks perpetuating biases. Despite ongoing efforts to tackle fairness issues, their application to educational datasets remains limited. To address the mentioned gap in the literature, this research evaluates the effectiveness of four bias mitigation techniques in an educational dataset aiming at predicting students’ dropout rate. The overarching research question is: “How effective are the techniques of reweighting, resampling, and Reject Option-based Classification (ROC) pivoting in mitigating the predictive bias associated with high school dropout rates in the HSLS:09 dataset?" The effectiveness of these techniques was assessed based on performance metrics including false positive rate (FPR), accuracy, and F1 score. The study focused on the biological sex of students as the protected attribute. The reweighting technique was found to be ineffective, showing results identical to the baseline condition. Both uniform and preferential resampling techniques significantly reduced predictive bias, especially in the FPR metric but at the cost of reduced accuracy and F1 scores. The ROC pivot technique marginally reduced predictive bias while maintaining the original performance of the classifier, emerging as the optimal method for the HSLS:09 dataset. This research extends the understanding of bias mitigation in educational contexts, demonstrating practical applications of various techniques and providing insights for educators and policymakers. By focusing on an educational dataset, it contributes novel insights beyond the commonly studied datasets, highlighting the importance of context-specific approaches in bias mitigation.

Keywords:

machine learning; bias mitigation; educational data mining; predictive bias; classification

Graphical Abstract

1. Introduction

Over the last decade, machine learning (ML) has been pivotal in shaping human decisions across diverse domains, ranging from job applicant screening and financial credit evaluation to critical tasks like cancer screening [1]. With ML technology catering to a broad spectrum of users with different characteristics such as race, gender, and socio-economic status (SES), ensuring equal service provision has become imperative [2]. However, this equality objective is not always achieved in practice. Instances of fairness issues have emerged, such as an ML algorithm suggesting higher re-offending chances for black individuals compared to white individuals, and algorithms favoring male workers over female workers in technical job placements [3]. When a predicted outcome becomes associated with individual attributes (e.g., race, class, gender, and sexuality) that are not relevant to the prediction task, it is deemed biased. The training data may be inherently skewed, or important underlying factors such as resource access may not have been considered in the algorithm [2,4].

Similar challenges extend to the educational domain, where data-driven technologies, including learning analytics and educational data mining, are employed to assess students’ performance and adjust their learning materials [5,6]. For example, bias in educational datasets can manifest when historical data reflect existing inequalities. If a dataset used to train a model predominantly includes students from well-funded schools, the model might unfairly disadvantage students from underfunded schools by failing to account for differences in resource availability and support structures [7]. This can lead to inaccurate predictions about students’ capabilities and needs. Similarly, dropout forecasting models can exhibit bias if they rely on data that do not adequately represent students of minority ethnics such as students who are Black, Indigenous, and people of color; they may not perform equally well with such a population and may lead to insufficient support and intervention for this group. Without careful consideration from educators, the use of ML-powered systems poses the risk of perpetuating biased decisions, particularly against students from marginalized groups, such as children from low SES families, Indigenous peoples, and people with visible or invisible disabilities [8].

Continued efforts to address the aforementioned fairness issues are crucial, given the pervasive influence of ML on our daily decision-making. This topic has garnered public attention as seen from Crawford [2]’s keynote on different sources of bias in artificial intelligence. Key methodological areas aimed at promoting fairness in ML include subgroup analysis, regularization, and calibration for classification tasks, as well as fair regression for regression tasks. Much of the research on fairness in ML utilizes the same datasets, such as the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) dataset or the German financial credit dataset [9,10,11]. However, the use of datasets in the educational context is limited. As noted by Caton and Haas [12], there is an extensive body of literature on approaches to promote fairness and mitigate bias in ML (e.g., Mehrabi et al. [10], Chouldechova [13]), but the complexity and technicality could pose challenges for non-technical audiences in education. Various toolkits for fair ML are available but lack the same accessibility as other ML toolkits developed for researchers and practitioners with potentially no coding experience.

To address the gap in literature and practice, this study assesses the effectiveness of bias mitigation techniques using an educational dataset—specifically, the High School Longitudinal Study of 2009 (HSLS:09)—in predicting students’ high school dropout status. The evaluation primarily leverages bias mitigation tools available in the moDel Agnostic Language for Exploration and eXplanation (DALEX) package in Python [14,15,16], specifically focusing on the reweighting [16], resampling, and Reject Option-based Classification (ROC) pivot techniques [17]. Our study aims to demonstrate a practical application of these bias mitigation techniques to guide a non-technical readership (e.g., researchers, educators, and other practitioners). The overarching research question is “How effective are the reweighting, resampling, and ROC pivoting techniques in mitigating the predictive bias associated with high school dropout rates in the HSLS:09 dataset?”

This study offers the following novelties: First, while most research in bias mitigation relies on datasets from the criminal justice [9] or financial sectors [18], our focus on educational data provides novel insights into how these techniques can be adapted and applied to predict high school dropout rates, a critical issue with crucial implications. Second, this study offers a comprehensive evaluation and comparison of bias mitigation techniques. By applying the three techniques of reweighting, resampling, and ROC pivoting to controlled conditions (i.e., using the same dataset and predictive algorithm), we enable researchers to identify the strengths and weaknesses of each technique within a real-world educational context, thereby advancing the field of ML fairness. Third, our work addresses the accessibility challenges in ML fairness modules. The current source codes and documentation are often unsuitable for non-technical audiences [10]. By translating complex ML fairness methodologies into practical guidelines for non-technical stakeholders, such as educators and policymakers, our study makes these methodologies more accessible. We provide an application-focused evaluation of the bias mitigation tools available in the DALEX package, emphasizing their applicability and effectiveness in an educational setting. This practical orientation not only bridges the gap between technical research and practical implementation but also empowers educators to make informed decisions to enhance fairness in educational outcomes.

The contribution of our study is twofold. From a methodological perspective, this study showcases the capability of the three bias mitigation techniques in a relatively novel dataset. On the practical side, our findings will inform educators and ML researchers about the effectiveness of the curated three techniques and guide their choice of bias mitigation tools.

2. Literature Review

2.1. Fairness in Machine Learning: Definition, Causes of Bias, and Dilemmas

2.1.1. Definition

As a sub-branch of ML, supervised ML refers to the use of predictive algorithms to learn the relationship between features and the target variable from labeled training data. For supervised ML models, fairness refers to the absence of systematic bias in predictive algorithms that renders their result more or less sensitive to specific subpopulations [19]. In educational settings, biased predictions influenced by an unfair ML model can lead to harmful consequences such as denial of school admission or undeserved placement in remedial lessons [1,19]. The definition of fairness in ML is nuanced, with variations based on different perspectives. Here, we will focus on three perspectives: legal, mathematical, and data.

From a legal standpoint, fairness entails the absence of disparate treatment (direct discrimination) and the prevention of disparate impact resulting from seemingly neutral actions (indirect discrimination) to individuals based on their protected characteristics such as race or gender [11]. From a mathematical standpoint, fairness is characterized by the equality of the percentage of positive predictions across demographic subgroups, emphasizing demographic parity. This ensures that the distribution of predictions closely aligns with the demographics of the intended population [4,20]. From the data standpoint, fairness involves achieving parity in the distribution of observed features (i.e., information representations), outcomes (i.e., predictions), scores (i.e., prediction performance), and decisions (i.e., consequences) [21].

We selected these three standpoints because they collectively address the entirety of the ML process. The definitions of fairness from mathematical and data standpoints encompass the data collection, preprocessing, analysis, and prediction stages. Simultaneously, the legal standpoint addresses fairness in terms of the consequences and applications of machine learning results. This holistic approach ensures a comprehensive understanding and consideration of fairness throughout the entire ML life cycle. Drawing upon these three standpoints, fairness in the context of this study refers to the parity of representations, predictions, and decisions across different demographic groups. Ideally, a fair ML model should achieve both between-group fairness (i.e., group level) and within-group fairness (i.e., individual level) by ensuring independence between the performance of the model and irrelevant characteristics of individuals [4,13,22].

2.1.2. Causes of Bias

Causes of systematic bias in ML are rooted throughout all components of the ML life cycle [10]. In the data production and management phase, the use of biased data may lead to biased predictions, analogous to the “garbage in, garbage out” concept, where poor-quality data produce poor-quality output [10,23]. Specifically, bias may be introduced during the data collection process, resulting in training data that do not reflect the characteristics of the intended population (i.e., population bias) [10]. Sampling bias, rating bias (e.g., conformity in answering survey items), or language barriers encountered when collecting data from populations that speak different languages could contribute to population bias. Even with proper data acquisition, bias could be introduced if data are inappropriately handled prior to analysis, especially for missing data. Vulnerable groups may be more reluctant to provide information that is sensitive to them, which causes missing values in the dataset used for training and testing ML algorithms [24]. Inappropriately deleting or imputing such values may improve the model’s performance at the expense of potential unfairness, as information or the lack thereof is “imposed” onto such populations [22,24,25].

During the prediction phase, algorithmic bias emerges from sources such as the choice of a predictive model, algorithm optimization, or feature selection [7,10]. The choice of a model can affect its interpretations and uses, and an inappropriate choice could lead to biased results. Rather than focusing solely on model performance, the decision for model selection should be guided by factors such as the nature of the dataset (i.e., size and quality), computational resources, and the purpose of prediction [26]. For example, when the dataset is not large enough, using a neural network predictor may yield less accurate results than the random forest algorithm [27]. When researchers want to prioritize interpretability over accuracy, a decision tree algorithm may be a better choice compared to a complex algorithm like random forest [28]. After choosing a model to use, researchers need to appropriately optimize their model via hyperparameter tuning to build an effective model; otherwise, the model may underperform and yield biased results [7,29]. Subsequently, researchers need to use appropriate feature selection methods to filter redundant variables from the model and consequently improve its performance [30]. Important variables identified from the feature selection method could aid in the interpretation of the results, which in turn inform decisions made by users and stakeholders of the model [30].

The last phase of the ML life cycle is user interaction. Here, a new form of bias may arise: emergent bias [10,31]. This kind of bias occurs from the emergence of new knowledge and the mismatch between users and the predictive system [31]. Decisions made from ML results are dynamic, meaning that they change upon human interaction with the environment [32]. Specifically, human observation of historical data in the environment guides their decision, which subsequently affects the understanding and knowledge of others in that environment, creating a feedback loop [32]. ML models serve as a component in the environment. When new knowledge not accounted for in the predictive model emerges (e.g., COVID-19), decisions made from the model may lose credibility and produce bias in human understanding, thereby necessitating a new predictive model [33]. Mismatches between users and predictive systems can also cause misunderstanding and unintended consequences, particularly when the prediction results are used by unintended populations with different expertise or values [31]. For example, predictive results from a model trained with student data from North American countries may not apply to students from Asia, thus reducing the credibility of the interpretations derived from the model. As bias from the user interaction phase cannot be fully controlled, this paper deals with sources of bias from the data and algorithm aspects.

2.1.3. Dilemmas

Caton and Haas [12] pointed out five key dilemmas in addressing bias and fairness issues in the real-world context. The first two dilemmas are unique and concern the technical aspect of the ML pipeline, and the remaining three dilemmas concern the implementation of algorithmic fairness in the real-world context.

First, there is a trade-off between fairness and the performance of an ML model [22,25]. Addressing disparity in an ML model involves altering the structure of the data or mechanism of the model, which may negatively affect the performance metric of the model [34]. As the fairness solution becomes more involved (e.g., adding weights to the model while concurrently resampling data of the minority group for model training), the performance metrics of the model could considerably decrease [34]. However, the reduction in performance metric may be desirable if it relies upon statistical disparity in the first place [22].

Second, the notion of fairness at the between-group level (i.e., group fairness) and within-group level (i.e., individual fairness) is different, and prioritizing fairness at one level may worsen it at the other level [12]. Despite having an equal representation of all population subgroups in the data, individuals with outliers or unique characteristics may still be wrongly classified [22]. This challenge could be attributed to the difference between the emphasis on the social norm and the emphasis on individual differences in the same way as the bias–variance trade-off; that is, choosing the norm may regard outliers as noise, while prioritizing individual-level data may result in overfitting as the model becomes too sensitive [22,35,36].

Third, a context-free, universally fair model is challenging to achieve, given the dynamic nature of human context [19,37]. For example, imagine an ML model designed to predict creditworthiness for loan approvals. If it is a context-free, universally fair model, it should make fair decisions across different demographic groups without favoring or discriminating against any particular background. However, this is inherently challenging to achieve because certain groups, such as refugees or minority populations, may have historically faced economic disadvantages, such as limited access to education and job opportunities.

Fourth, despite the increasing accessibility of ML to non-experts, the availability of accessible fairness tools has yet to catch up and still demands a higher level of technical expertise [9,38]. This disparity in development may entail the unintentional insensitive use of ML technologies as non-expert developers implement ML into the decision-making process without paying adequate attention to the fairness issue [12].

Finally, the notion of robustness for fairness in real-world applications is challenging due to the uniqueness of real-world data. Much research in algorithmic fairness relies on the same dataset, such as the COMPAS dataset [9,11]. However, each real-world dataset is unique, so algorithms that were proven robust on these datasets may become less effective in the actual implementation on other datasets. Ideally, one should exhaustively try as many combinations as possible, but that would be impractical [39]. Consequently, performance discrepancy should be expected.

2.2. Potential Harms and Consequences from Lack of Fairness in ML

When researchers’ oversight leads to implementing a biased predictive model, several types of harm can happen to stakeholders associated with decisions made from the model. Barocas et al. [1] and Crawford [2] outline the potential harms from biased prediction results as follows: (1) allocation harms—the harm that occurs when resources are withheld from specific population groups; (2) quality-of-service harms—the harm that occurs when the system does not perform equally well across population groups; (3) stereotyping harms—the harm that occurs when the system provides suggestions that stem from stereotyped ideas; and (4) erasure harm—the harm that occurs when the system regards outlier information of minority populations as trivial or non-existent. These types of biases and harms are prevalent in vulnerable and underrepresented populations [40].

Aside from the aforementioned harms, some pitfalls could occur when ML is applied to solve social problems as outlined by Selbst et al. [41]. First, the solutionism trap could occur when researchers believe in technology-driven solutions and fail to account for context, such as rural communities with low internet accessibility [42]. Second, the ripple effect trap occurs when users fail to consider unintended consequences when using ML to inform their decisions. For example, a teacher may rely on learning analytics results without discussing their class performance with students [41]. Third, the formalism trap could occur when researchers rely on formal performance metrics such as statistical parity and fail to recognize contextual problems [41]. For example, English scores between students from English-speaking and non-English-speaking countries may not be directly comparable. Fourth, the portability trap could occur when researchers reuse a model trained with data from one context to a population in a different context [43]. For example, an ML algorithm trained with data from education students may not be appropriate for use with engineering students. Finally, the framing trap could occur when researchers fail to capture the big picture of the social problem in their prediction [41]. For example, in assessing recidivism in datasets such as COMPAS, the mental health or SES of the population should be considered in addition to criminal history [43].

Researchers’ or users’ oversight regarding the aforementioned harms and traps could lead to unintended consequences throughout the machine learning life cycle. For example, while researchers may not directly collect discriminative data such as race, religion, and gender, they may collect closely correlated information such as medical history, immigration history, or disability status [44]. Such variables can still be inferred back to the discriminative variables and ultimately lead to biases and harm. In cases where sensitive data are withheld by individuals, imputing such data for completion may exacerbate the unfair treatment of underprivileged groups, as they may receive higher likelihoods of unfavorable predictions [25]. Examples of unintended consequences in the operational phase of ML algorithms include the 2019 case of racial bias in an algorithm predicting health care risks where the prediction favored white patients over black patients in recommending additional health care options [45], and the case of the COMPAS algorithm, where black defendants were more likely to be predicted with false positive for recidivism than white defendants, while white defendants were predicted with false negative more frequently than black defendants [46]. The discussed examples show that predictive algorithms could impose dire consequences if fairness is overlooked.

2.3. The Fairness Module in DALEX

To ensure the fairness of an ML model, we need to ensure that our model performs classification tasks equally well across subgroups. This study relies on algorithms from the fairness module in the DALEX package in Python [14], including a fairness check algorithm and three different bias mitigation algorithms. Here, we will provide a technical overview of how these algorithms work by using an example with two classification classes (denoted with class values of 0 and 1). In the classification model, the target variable (denoted as y) represents the ground truth, and the predicted value (denoted as

\hat{y}

) represents the probability that a case will be predicted to belong to class 1.

2.3.1. Fairness Check

The “fairness_check” algorithm in the DALEX package is employed to examine model-level fairness. This algorithm requires researchers to convert a classification model, features, and target variable into a fairness object to extract the values of y and

\hat{y}

[47]. Then, researchers need to specify several parameters as follows: (1) a protected attribute such as race or gender (denoted as “protected”), (2) its privileged subgroup in the protected attribute such as white or male (denoted as “privileged”), (3) a cutoff threshold for the probabilistic output of a classifier (denoted as “cutoff”), and (4) a parameter to define an acceptable range of the fairness score (denoted as “epsilon or

ϵ

”). The epsilon parameter ranges from 0 to 1; the higher it is, the more stringent it is for the algorithm to achieve fairness.

The “fairness_check” algorithm compares five performance metrics of the overall sample and the privileged group, namely true positive rate (TPR), false positive rate (FPR), accuracy (ACC), positive predictive value (PPV), and statistical parity (STP) against the acceptable fairness score (

ϵ

) range [48,49]. Specifically, as shown in Equation (1), it examines if each metric ratio between the overall metric score of a subgroup i (

m e t r i c_{i}

) and the metric scores of the privileged subgroup (

m e t r i c_{p r i v i l e g e d}

) is within the acceptable range of

ϵ

and

\frac{1}{ϵ}

ϵ < \frac{m e t r i c_{i}}{m e t r i c_{p r i v i l e g e d}} < \frac{1}{ϵ}

(1)

After obtaining the fairness check results, the five fairness metrics across subgroups can be visualized to provide a comprehensive understanding of the results. Values of the five metrics can be compared across multiple subgroups (e.g., young-male, old-female, or black-female). Alternatively, metric differences between subgroups can be summarized into a single metric parity loss value, which indicates the degree of metric differences across subgroups. This metric parity loss value can be calculated as in Equation (2), summing over all subgroups i:

{m e t r i c}_{p a r i t y l o s s} = \sum_{i ϵ a, b, . . ., z} |l o g (\frac{m e t r i c_{i}}{m e t r i c_{p r i v i l e g e d}})|

(2)

This mathematical function accommodates both positive and negative metric differences before summing the log ratios to form a single value for comprehensive representation. The more significant the difference in metrics between subgroups, the higher the parity loss value will be. Researchers can also compare the parity loss value with model performance to examine the fairness and performance trade-offs.

An important point worth noting is that the default

ϵ

value to define the acceptable range in Equation (1) is 0.8, which adheres to the four-fifth rule that is used to determine bias in the personnel selection process [14]. This value is widely adopted in assessing impact across population subgroups [50,51]. However, the adoption of this rule of thumb is unclear and has been subjected to scrutiny [50,52,53]. Bobko and Roth [50] noted that the adverse impact of subgroup differences is a function of several factors, such as sample size and effect size, thus making the four-fifth rule of thumb too deterministic. In certain scenarios, the interplay of factors such as data availability, the nature of selected features, and the computational capacity of the researcher may result in the model metrics falling outside the acceptable range calculated from the four-fifth rule. Consequently, relying solely on this rule might incorrectly indicate the model’s lack of fairness without due consideration of other influential elements [53]. Researchers can adjust the cutoff point based on other contextual evidence such as predictive validity evidence (e.g., accuracy, recall, area under curve [AUC]), differential validity evidence (e.g., class imbalance), or relative fairness among study conditions [52]; for example, if the model reports moderate accuracy, researchers could consider lowering the threshold to accommodate the performance gap [52]. In other words, if the chosen cutoff is overly idealistic, there is a risk of erroneously labeling models as unfair without considering their context.

To address the four-fifth cutoff issue, we have developed a Python function that systematically assesses the fairness of a model across a range of

ϵ

values rather than relying on a single threshold. This function serves as an extension of the “fairness_check” function in DALEX, incorporating additional parameters such as the starting and ending values for

ϵ

, along with an increment parameter that facilitates step-wise adjustments of the

ϵ

value during the fairness evaluation. For instance, if provided with a starting

ϵ

value of 0.71, an ending

ϵ

value of 0.8, and an increment of 0.1, the function iteratively conducts fairness checks with

ϵ

values ranging from 0.71 to 0.80 and returns the results in ten batches. This function aims to determine the initial

ϵ

value within a specified range that identifies the model as unfair through an exhaustive search. This approach allows researchers to identify a practical cutoff for the metric disparity, thereby enhancing the model’s fairness at the threshold where unfairness emerges. The rationale is that relying on a rule-of-thumb value may lead to prematurely labeling a moderately performing model as unfair and discarding it, overlooking the potential for improvement with a more realistic standard. Incremental enhancements to such a model can serve as a stepping stone toward a more refined version, allowing practical utilization while avoiding unrealistic expectations for a perfect model that may not be attainable.

2.3.2. Bias Mitigation Techniques

Once bias is identified within the model, various strategies can be employed to mitigate it. This study focuses on bias mitigation techniques tailored for classification models. Specifically, we introduce three techniques from the fairness submodule within the DALEX package: reweighting, resampling, and ROC Pivot [14]. These three techniques involve adjusting the predictive algorithm, refining the dataset, and modifying the prediction results.

Reweighting

The reweighting method calculates weights for each combination of protected subgroups and predicted classes (e.g., female-positive, female-negative, male-positive, and male-negative) to mitigate statistical disparity in model training [14,54]. Sample weight, in general, is a parameter that modifies decision boundaries in a classification task for equal representation among subgroups [55]. Underrepresented samples are likely to receive more weight, while over-represented samples are likely to receive less weight [55]. This technique is incorporated into an ML pipeline as a data preprocessing procedure. The reweighting function requires inputs of protected variables and the target variable. The assumption behind this technique is that the protected variables are independent of the target variable, meaning that bias in the results happened because of the under- or over-representation of specific subgroups [54]. By assigning weights to each combination of protected subgroup and predicted class in the model training process, metric discrepancies among subgroups can be mitigated [14,54]. The steps in computing weights for each mentioned combination are illustrated in Figure 1 [14].

For each unique subgroup and class combination in the target variable, the weight

W_{s c}

is calculated by applying Equation (3) as follows:

W_{s c} = \frac{(X_{s} \cdot X_{c})}{c o u n t_{y} \cdot X_{s c}}

(3)

where

X_{s}

is the number of observations in the subgroup,

X_{c}

is the number of observations in the class,

X_{s c}

is the number of observations where both the subgroup and class conditions are met, and

C o u n t_{y}

is the total number of observations in the target variable y.

The classifier then applies the sample weights to mitigate bias by factoring it into training. This technique applies only to classifiers that can incorporate sample weights. To summarize, the weight of an observation is calculated as the anticipated probability of encountering an instance with its protected variable value and assigned class, assuming no correlation between both variables, divided by its observed probability [54]. When the expected probability matches the observed probability within a specific subgroup and class combination, the representation of that class is deemed unbiased [54].

Resampling

In a scenario where applying sample weights to the classifier is not possible, the reweighting function can be applied to modify the sample representativeness of each subgroup and predicted class combination in the dataset with the resampling function [14,54]. This method is also a part of the data-preprocessing pipeline. Like the reweighting function, the protected and target variables are included as input and assumed to be independent. In addition, the resampling function requires the specification of the resampling strategy in the “type” argument. It computes the desired number of observations through resampling with replacement to achieve a sample size that may be numerically non-discriminatory [54]. This method applies to all classifiers because it does not involve modifying the classifying algorithm itself.

The steps in modifying the sample size of the dataset are described in Figure 2 as follows [14]: First, the weight for each unique subgroup and predicted class combination is computed using the reweighting function. This is a crucial step, as it determines the proportion of each subgroup and class in the resampled data. Next, the expected sample size for each subgroup and class combination is calculated by multiplying the observed sample size by the weight and rounding the result to the nearest integer.

Once the expected sample size is obtained, we compare it with the actual (or observed) sample size of that particular subgroup and class combination. This comparison leads to three possible scenarios. First, if the expected size equals the observed sample size, we include all observations corresponding to that subgroup and class in the final resampled sample pool. This means that the original sample size was already optimal for that subgroup and class. Second, suppose the expected size is less than the observed sample size. In that case, the algorithm performs undersampling, which helps to prevent the overrepresentation of specific subgroups or classes in the resampled data. Third, if the expected size exceeds the observed sample size, the algorithm performs oversampling to ensure that underrepresented subgroups or classes are adequately represented in the resampled data. The type of undersampling or oversampling depends on the resampling strategy specified. Through this process, we ensure that the resampled data are balanced and representative of the original data, thereby mitigating the predictive bias of the ML model.

The resampling function offers two strategies: uniform random resampling and preferential resampling [54]. For uniform random resampling, the function randomly resamples cases from the current combination with replacement until the expected size is reached. In the preferential resampling process, the function begins by sorting the cases in each subgroup based on their

\hat{y}

values. The sorting is performed in ascending order for cases predicted to belong to class 0 (i.e., a negative prediction). Conversely, the sorting is performed in descending order for cases predicted to belong to class 1 (i.e., a positive prediction).

Following the sorting process, the function compares the expected sample size with the observed sample size, leading to two possible scenarios: First, if the expected sample size is less than the observed sample size, the function performs preferential undersampling. Within the two classes, it retains the first ‘expected number’ of cases from the sorted list and discards the rest once the desired sample size is reached. Due to how the cases were sorted, the function will retain the cases with higher probabilities of belonging to the class of interest. For example, it will retain cases with the smallest

\hat{y}

values in class 0 and the largest

\hat{y}

values in class 1. In essence, the function works to keep fewer samples, with a preference towards ones with higher probabilities of belonging to the class of interest.

Second, if the expected sample size exceeds the observed sample size, the function performs preferential oversampling. For the class value of 0, it duplicates the sorted cases from the top until the required sample size (expected) is filled in the resampled dataset. For the class value of 1, it duplicates the sorted cases from the bottom until the required sample size (expected) is filled in the resampled dataset. In summary, this function adjusts the sample size based on the computed weights and the configured sampling strategy to achieve the desired representation of each subgroup and predicted class combination to mitigate bias within the model. In essence, the function works to duplicate more samples, favoring ones with higher probabilities of belonging to the class of interest.

ROC Pivot

Unlike the reweighting or resampling methods performed during the preprocessing phase, the ROC Pivot method is a post-processing bias mitigation method. This method modifies the predicted class of borderline cases in a classification model based on the protected subgroup variable [14,54]. As a result, the function provides a new model with modified

\hat{y}

probabilities. Specifically, the function shifts the predicted class of privileged and protected subgroups close to the cutoff from positive to negative and vice versa [56]. This function assumes that the targeted variable’s positive class (1) indicates a favorable outcome. In scenarios where the positive class indicates a negative interpretation (e.g., cancer, dropout, and recidivated), the interpretation of the results may be reversed. As a result, researchers may need to adjust the configuration of the function accordingly in a classification task where the positive class indicates a negative interpretation. The rationale behind this function is to reduce the false prediction rate that may happen with borderline cases by pivoting their classified label to the other side; in this case, the classification border is considered a low-confidence region.

This method operates similarly to the cost-sensitive classification method, where the consequences of misclassifying protected groups are much more detrimental than misclassifying privileged groups [57]. For example, students with unfavorable backgrounds may incur harder consequences than students with favorable backgrounds when they are predicted to have a high likelihood of dropping out and being treated as such. These repercussions could include unnecessary remedial classes, exclusion from advanced programs, and the burden of additional tuition costs. This situation could also lead to self-doubt about their subject knowledge. Conversely, students from more favorable backgrounds, with moderate to high resources and academic performance, may not experience these challenges to the same extent. To create a modified model with modified

\hat{y}

probabilities, this function needs the following inputs: the original model, the protected variable, the privileged subgroup in the protected variable, and the pivoting threshold in the “cutoff” and “theta” arguments [14,54]. The cutoff value is the probabilistic threshold of the classifier, while the theta value indicates the radius of the low-confidence region around the probabilistic threshold.

The steps in pivoting the predicted classes of observations are described in Figure 3 as follows [14]: First, the function determines the predicted probabilities (

\hat{y}

) in the original ML model. Next, a low-confidence region is established using the cutoff and theta values. For instance, with a cutoff of 0.5 and a theta of 0.01, the low-confidence interval would be 0.5 ± 0.01, or (0.49, 0.51). This interval represents the range of probabilities considered uncertain or low confidence. The function then examines each case. For cases with a privileged attribute and a favorable prediction (class 1), it pivots the probabilities to the other side of the cutoff, i.e., 0. This means that these cases are now predicted as class 0, despite initially being predicted as class 1. Similarly, for cases with an underprivileged attribute and an unfavorable prediction (class 0), it pivots the probabilities to the other side of the cutoff, i.e., 1. This means that these cases are now predicted as class 1, despite initially being predicted as class 0. Finally, the function returns a model with modified

\hat{y}

probabilities, reflecting the pivoted predictions. This process ensures that the model’s predictions are more balanced and less biased towards a particular class or attribute.

It is important to note that although the default values for the cutoff (0.5) and theta (0.01) are provided, there are no predefined guidelines for selecting these parameters. A smaller theta narrows the pivot region and potentially leads to more targeted adjustments, while a larger theta makes the pivot region broader and may cover more low-confidence predictions. The choice of these values depends on the characteristics of the dataset and the requirements of the researcher. In summary, this function adjusts the

\hat{y}

of a classification model to mitigate its bias within the low-confidence region.

Table 1 presents a critical comparison of the selected three bias mitigation techniques. The reweighting technique operates by applying sample weights to mitigate predictive bias in an ML algorithm [14]. Given its operating principle, the reweighting technique presents minimal changes in an ML workflow, as sample weight can be added through additional arguments when fitting a predictive model. However, different ML algorithm requires different arguments to apply the technique, making it challenging to implement uniformly across a project that involves several predictive models [55]. The resampling technique addresses this challenge by modifying the dataset to increase sample representativeness of the underrepresented subgroup. This operating principle allows the technique to be applicable consistently across different machine learning algorithms. However, modifying the dataset through resampling may result in a dataset that is not representative of the actual population. The third technique, ROC Pivot, operates by modifying the predicted class of borderline cases to mitigate predictive bias in cases that may be false positive or false negative. This technique is minimally invasive, as it alters only the prediction of borderline cases instead of the predictive algorithm or the dataset. However, there are no predefined guidelines for selecting the low-confidence region (set by cutoff and theta), thereby relying on trial and error for the technique to be applied effectively.

3. Present Study

This study broadens the understanding of fairness in ML by applying it to the field of education. Specifically, we focus on the HSLS:09 dataset, demonstrating the application of three bias mitigation techniques: reweighting, resampling, and ROC pivot to a classification algorithm that predicts high school dropout rates. The overarching research question is: “How effective are the reweighting, resampling, and ROC pivoting techniques in mitigating the predictive bias associated with high school dropout rates in the HSLS:09 dataset?” To answer this question, we will augment the dataset to mitigate the class imbalance between dropout and non-dropout students. An ML classifier will be tuned for optimal prediction performance, and predictive bias will be assessed based on the biological sex of the students (i.e., male and female). We categorize the bias mitigation techniques into four conditions: (1) uniform resampling, (2) preferential resampling, (3) reweighting, and (4) ROC pivot. These conditions, including the baseline model, will be compared based on their performance metrics (i.e., TPR, PPV, FPR, ACC, and STP) and metric parity loss. In addition, we will compare the classifiers from all conditions in terms of performance to examine the trade-off between fairness and performance. This study aims to highlight the effectiveness of bias mitigation techniques employed in an educational dataset, providing insights for educators and ML researchers in their practice.

4. Methods

4.1. Dataset and Data Preprocessing

The HSLS dataset utilized in this study is derived from a longitudinal examination of 9th-grade students in the United States [58]. We utilized the Python programming language as the primary tool for data preprocessing, classification model training–testing, and bias detection and mitigation [59]. Out of the extensive HSLS:09 dataset featuring 4014 variables, we identified 15 predictors based on Tinto’s theoretical dropout model [60]. These predictors have been empirically validated as being influential in predicting high school dropout rates [61]. Additionally, students’ sex was included as the grouping variable for predictive bias assessment, resulting in a total of 16 predictors and 1 target variable as outlined in Appendix A. The initial sample comprised 16,136 individuals (

N = 14,132

for non-dropout students and

N = 2004

for dropout students). To address this class imbalance, we employed a hybrid resampling technique. Specifically, we implemented the Synthetic Minority Oversampling Technique for nominal and categorical data (SMOTE-NC) and random undersampling (RUS) to increase minority cases and decrease majority cases [62,63]. The SMOTE-NC algorithm synthesized 80% of the minority cases, while the RUS algorithm undersampled non-minority cases to match the number of minority cases. Consequently, the final sample size after data augmentation was

N = 221,610

, with

N = 11,305

for each class of the outcome variable.

The train–test split was performed, allocating 50% of the data to both the training and test datasets. This equal proportion is essential because the DALEX explainer, which is necessary for assessing the predictive bias of a classification model, is constructed with the testing dataset [14]. However, specific bias mitigation techniques, such as reweighting and resampling, involve retraining the model with weights and samples computed using

\hat{y}

probabilities derived from the testing dataset during the pre-mitigation phase [14,54]. Suppose the training dataset does not match the testing dataset. In that case, discrepancies may arise between the number of

\hat{y}

values and the number of cases in the training data, potentially confusing the algorithm. Therefore, the equal train–test split proportion must be maintained to ensure the successful operation of the bias mitigation algorithms. Consequently, applying data augmentation solely to the training dataset is not feasible. Despite the need to leave the testing dataset imbalanced to mirror the real-world context of the data, it is essential to maintain exact equality between the training and testing data for the reason stated earlier.

4.2. Classification Model

We employed the decision tree classifier from the scikit-learn package [55] to develop a predictive algorithm for bias assessment. The decision to utilize this classifier was driven by its lightweight nature and compatibility with all three bias mitigation algorithms, making it practical for our study, which involves retraining the model. Specifically, the decision tree algorithm allows the manual input of sample weights with the “sample_weight” argument, while at the same time not requiring substantial computational resources for multiple iterations of training and testing the classifier [55].

To optimize the classification algorithm, we utilized the GridSearchCV function to systematically explore various combinations of hyperparameter values, aiming to identify the set that maximizes predictive accuracy through five-fold cross-validation. The hyperparameters considered included the following: maximum tree depth (max_depth) with values of 3, 5, and 7; minimum number of samples required to split an internal node (min_samples_split) with options of 2, 5, and 10; minimum number of samples required to be at a leaf node (min_samples_leaf) with choices of 1, 2, and 4; number of features to consider when searching for the best split, with options of none, sqrt, and log2; and the criterion for measuring the quality of a split, either the Gini index or entropy. Subsequently, the tuned classification model underwent training, testing, and validation using 10-fold cross-validation to account for fluctuations in predictive performance. Finally, we extracted the mean and standard deviation (SD) of key performance metrics, including accuracy, precision, recall, F1 score, mean squared error, and AUC from 10 model fit iterations.

4.3. Bias Detection and Mitigation

For the bias detection and mitigation segment, this study relies on the fairness module within the DALEX package in Python [14]. To evaluate predictive bias in the developed classification model, we utilized the “model_fairness” function, employing students’ gender as the protected variable. Specifically, the male gender (coded as 0) was designated as the privileged group, while the female gender (coded as 1) served as the protected subgroup for the bias assessment. The sample proportions between male and female students were 11,892 (52.6%) and 10,718 (47.4%) respectively. We employed a step-wise

ϵ

search method as described earlier to determine an appropriate

ϵ

cutoff value for bias detection. Beginning with an

ϵ

of 0.45 and ending with 0.65, with a stepwise increment of 0.01, the algorithm assessed bias at increments from 0.45 to 0.65, resulting in a total of 20

ϵ

candidates. Five metrics (TPR, PPV, FPR, ACC, and STP) were evaluated as absolute indices to evaluate the predictive bias of the classifier. Additionally, parity loss of the five metrics was consulted as relative bias indices of predictive bias.

Following the identification of predictive bias, four types of bias mitigation methods were employed for comparison. The first condition involved the reweighting method, which applies sample weights to balance the representativeness between the under-represented and over-represented populations. Sample weights were calculated using the reweight function based on the protected variable and cases in the training dataset, and the classifier was then retrained based on these weights.

The second and third conditions involved uniform resampling and preferential resampling methods. The uniform resampling method randomly oversamples the under-represented population and undersamples the over-represented population. In contrast, the preferential resampling method selectively adds or removes cases based on the strength of their predictive results to balance the representation between the under-represented and over-represented populations. The resample function was utilized for both methods by specifying the protected variable and the original training dataset. For uniform resampling, the function was used in its default settings. However, for preferential resampling, the type argument of the resample function was set to preferential. Additionally, the resampling probability was determined by

\hat{y}

, which was derived from the DALEX explainer of the model. The resampling method produced a dataset to retrain the classification model and mitigate its predictive bias.

The fourth condition involved the ROC pivot method, which modifies the prediction results of borderline cases based on a predetermined low-confidence region. Cases within this low-confidence region are assumed to be potentially false predictions (i.e., false positives or false negatives) due to their borderline characteristics. The roc_pivot function was utilized to modify the predicted class of borderline cases. This function operated based on the original

\hat{y}

from the DALEX explainer, the specified protected variable, and cutoff and theta values. Default cutoff (0.5) and theta (0.05) values were used to establish the low-confidence region to modify the predicted class of cases. The conditions were compared across the five metrics and metric parity loss. Additionally, the model performance (accuracy and F1 score) of each condition was examined to assess the fairness–performance trade-off. AUC was not included in the model performance comparisons, as it is not sensitive to cutoff modification from the bias mitigation algorithms [14].

The selection of these four bias mitigation algorithms is due to their availability under the DALEX package, which allows the workflow to be streamlined under the same environment for ease of implementation. For example, results from the baseline condition and the reweighting condition can be compared by using the functions available under the DALEX ecosystem [14]. This workflow is best implemented under the DALEX environment due to their technical compatibility.

5. Results

5.1. Classification Results

The hyperparameter tuning process yielded the optimal combination of hyperparameters as ’criterion’ = ’entropy,’ ’max_depth’ = 5, ’max_features’ = None, ’min_samples_leaf’ = 1, and ’min_samples_split’ = 2. The mean and SD of classification results from 10 models are as follows: mean accuracy = 0.8429 (SD = 0.0084); mean precision = 0.9038 (SD = 0.0077); mean recall = 0.7641 (SD = 0.0151); mean F1 score = 0.8281 (SD = 0.0104); mean AUC = 0.9042 (SD = 0.0069). These classification metrics indicated satisfactory results, suggesting that the model can reliably predict students’ dropout history. High accuracy, precision, F1 score, and AUC indicated the classifier’s ability to accurately predict the outcome variable classes in students, particularly in true-positive instances [64]. However, the model exhibited lower recall than other metrics, implying that, despite its high precision value, it may struggle to identify all dropout cases. In other words, while the model may not capture every dropout case, its predictions of dropout instances are reliable [55,64].

5.2. Bias Detection and Mitigation

5.2.1. Bias Detection Results

The step-wise

ϵ

search algorithm identified

ϵ

= 0.53 as the threshold for detecting bias within the decision tree model. Hence, the five predictive bias metrics must fall within the range of (0.53 to 1.887) to indicate an unbiased model. Table 2 presents the metrics for all conditions, including the baseline model. Upon implementing the bias detection algorithm, it was discovered that the baseline classification model exhibited bias in the FPR metric (0.529412). However, the remaining four metrics of the baseline model were considered unbiased, with TPR = 0.8955, ACC = 0.9928, PPV = 1.0256, and STP = 0.7682. Figure 4 visually represents the scores of these five bias metrics for the baseline model, highlighting their dynamics within the condition. The figure shows that FPR stands out as the most biased metric, followed by STP. The bias detection results of the baseline model indicate a higher likelihood of incorrectly predicting female cases as dropout students compared to their male counterparts as evidenced by the high FPR [64]. This interpretation is consistent with the high STP value, suggesting that the model predicts a higher dropout rate for females than males, albeit not reaching the critical region of predictive bias [52]. Furthermore, the parity loss metric reveals a similar dynamic among the five metrics. Table 3 and Figure 5 illustrate that the parity loss of FPR (0.636) is the highest, followed by the parity loss of STP (0.264) in the baseline model.

5.2.2. Bias Mitigation Results

Four bias mitigation methods were employed to address the identified bias in the baseline model. The reweighting method computed weights for each combination of protected subgroups and predicted classes as outlined in Table 4. The assigned weights for each combination are as follows: the female-negative group has a sample weight of 0.9942; the female-positive group has a sample weight of 0.9937; the male-positive group has a sample weight of 1.0057; and the male-negative group has a sample weight of 1.0065. As indicated in Table 2, the bias detection results revealed bias in the FPR (0.529412) of the reweighted model. In fact, this condition’s metric ratio and parity loss values mirror those of the baseline model, suggesting that the reweighting method did not produce significant changes in predictive bias. This outcome may be due to the minimal differences in weights assigned (i.e., a 0.01 difference in weight between the privileged and protected subgroups), indicating that the method had only a marginal impact on the model’s predictions. This observation is further supported by Figure 6 and Figure 7, showing that the model’s metrics and metric parity loss exhibited identical patterns to those of the baseline model.

In the second condition, uniform resampling successfully mitigated predictive bias in the classification model. The model, re-trained with the uniformly resampled dataset, exhibited the following metrics: TPR = 1.0247, ACC = 0.9934, PPV = 0.8571, FPR = 1.0566, and STP = 1.0546. All five metrics remained within the (0.53, 1.887) range, indicating no predictive bias. Compared to the baseline condition, the uniform resampling method drastically reduced predictive bias, particularly in FPR as illustrated in Figure 6. The figure indicates that the FPR of this condition exceeded the 1.0 boundary while the FPR of the baseline model remained below 1. Metric parity loss, shown in Table 2 and Figure 7, revealed significantly smaller predictive parity in TPR (0.024), FPR (0.055), and STP (0.053). However, there was an increase in the PPV value (0.154) compared to the baseline model, suggesting a relatively higher number of true positive cases in the female subgroup. This condition’s ACC metric parity loss (0.007) remained unchanged from the baseline model.

Similarly, the third condition, preferential resampling, also reduced predictive bias in the classification model. The model, re-trained with the dataset from the preferential resampled method, exhibited the following metrics: TPR = 0.9698, ACC = 0.8833, PPV = 0.8652, FPR = 0.9944, and STP = 0.9864. Similarly to the uniform resampling condition, all five metrics of this condition remained within the

ϵ

range and exhibited drastically less predictive bias than the baseline model. This trend was also reflected in the parity loss, with smaller values in TPR (0.031), FPR (0.006), and STP (0.014). However, this condition showed higher PPV parity loss (0.145) and ACC parity loss (0.124) compared to the baseline model, potentially indicating a higher accuracy in predicting the protected subgroup due to an increased proportion of female cases in the resampled dataset.

In the fourth condition, ROC pivot, predictive bias in the classification model was also reduced. After the algorithm modified the predicted class of low-confidence cases, the post-pivoted explainer exhibited the following metrics: TPR = 0.9031, ACC = 0.9952, PPV = 1.0234, FPR = 0.5490, and STP = 0.7768, all of which fell within the non-bias range of (0.53, 1.887). As shown in Figure 6, this condition slightly reduced the FPR metric of the baseline model to the marginally adequate level to be considered unbiased. The pattern of metric parity loss was similar to the baseline model (see Table 2 and Figure 7), with TPR = 0.102, ACC = 0.005, PPV = 0.023, FPR = 0.600, and STP = 0.253. This indicates that the ROC pivot condition reduced the predictive bias of the baseline model to a level just sufficient to pass the

ϵ

threshold.

5.2.3. Performance Comparison between Bias Mitigation Conditions

The findings demonstrate that the uniform resampling, preferential resampling, and ROC pivot methods effectively mitigated predictive bias in the developed classifier. However, a critical consideration arises regarding the fairness–performance trade-off, wherein the adjustment in model structure may compromise its predictive power [22,34]. To evaluate this trade-off, Table 5 and Figure 8 juxtapose the accuracy and F1 score of all conditions alongside FPR parity loss—initially identified as biased—as the critical metric. This comparison aims to identify the optimal bias mitigation method within the scope of this study. The reweighting condition is excluded from consideration, as its results mirror those of the baseline model, indicating its predictive bias. Despite effectively reducing bias in the FPR metric, the uniform and preferential resampling conditions exhibit diminished predictive performance despite effectively reducing bias. The former condition achieves an accuracy of 0.459 and an F1 score of 0.428, while the latter yields similar results with an accuracy of 0.459 and an F1 score of 0.428. These outcomes suggest that the resampling methods sacrifice significant model performance to alleviate prediction bias. Conversely, the ROC pivot method minimizes model bias to just below the

ϵ

threshold while maintaining comparable performance to the baseline condition. Specifically, the ROC pivot condition achieves an accuracy of 0.842 and an F1 score of 0.827, which aligns with the baseline model’s performance but without bias. The model performance comparison indicates that the ROC pivot method stands as the optimal approach to mitigating bias within the context of this study.

6. Discussion and Conclusions

This study assessed the effectiveness of four bias mitigation techniques—reweighting, uniform resampling, preferential resampling, and ROC pivot—using the HSLS:09 dataset. Specifically, these techniques were applied to mitigate bias in a decision tree classifier predicting high school students’ dropout rates. To answer the research question, the reweighting technique was found to be ineffective in reducing predictive bias in the utilized dataset, evidenced by its performance metrics mirroring those of the baseline condition. This outcome may be attributed to the minimal weight differences assigned to each combination of protected subgroups and predicted classes. Both uniform and preferential resampling techniques significantly reduced predictive bias, particularly in the FPR metric of the baseline model. However, these methods also diminished the classifier’s predictive power, as reduced accuracy and F1 scores indicated. In contrast, the ROC pivot technique marginally reduced the predictive bias of the baseline model below the

ϵ

threshold while maintaining its original performance. Hence, it emerged as the optimal technique within the studied dataset of HSLS:09. This study is notable for its application of bias mitigation techniques to an educational dataset of HSLS:09, contributing to the topic beyond commonly studied datasets like COMPAS [9,10,11].

Further, this research extends our understanding of bias mitigation in an educational scenario of high school dropout, emphasizing the importance of context-specific approaches. For example, a university’s admissions office could use the ROC pivot method to refine their predictive model for identifying at-risk students, ensuring that the model remains accurate while fairly representing students from diverse backgrounds. This could help the admissions team provide targeted support to those students more effectively. Additionally, high school counselors could apply the preferential resampling technique to their predictive models for student performance. By identifying and mitigating biases in these models, counselors can ensure that interventions are equitably provided, thereby supporting students who might otherwise be overlooked.

From a technical standpoint, this study showcases the usage, advantages, and disadvantages of the selected bias mitigation techniques. While all techniques were successfully applied, the ease of implementation varied. The decision tree classifier algorithm was straightforward to implement as documented in the scikit-learn package [55]. However, techniques such as reweighting may pose challenges. For example, the decision tree classifier function may be able to accommodate sample weights when instantiating the model, but the same argument may not be available in the model configuration phase, having to be passed through the fit method in packages such as XGBoost [65] or Keras [66]. Depending on the algorithm employed, adjustments to the code may be necessary, requiring additional considerations during implementation. In contrast, resampling methods are more straightforward, involving the creation of a new training dataset without modifying the classification algorithm itself. However, as observed in this study, they may adversely impact algorithm performance. To mitigate such effects, ensemble classifiers like model averaging or stacking could be employed [67]. Despite its effectiveness, the ROC pivot method presents its challenges. Researchers must determine optimal cutoff and theta combinations, requiring careful exploration to minimize the low-confidence region while mitigating predictive bias. Nevertheless, while each technique may present its complexities, they are generally straightforward to use. Researchers are encouraged to experiment with different techniques to identify the most suitable ones for their specific study context.

From a practical standpoint, the findings of this study offer insights for researchers, educators, and policymakers regarding the efficacy of each bias mitigation algorithm. In our study, the ROC pivot technique emerged as the most effective bias mitigation method in the utilized dataset by striking a balance between fairness and performance. However, it is important to note that it still exhibits a higher FPR disparity than other non-biased conditions. Educators relying on this classifier should be mindful of this discrepancy and factor it into their decision-making processes. For instance, educators could seek additional evidence to support or refute the classification result when a student is predicted as a dropout by the ROC pivot condition. This approach strengthens the rationale behind their decisions and acknowledges potential flaws in predictive ML models rather than dismissing them outright.

The mentioned flaw of high FPR raises ethical considerations that researchers or educators should consider when utilizing bias mitigation techniques in actual implementation. For example, a model with a high false positive rate could be offset by implementing additional support measures for students flagged as at risk. Rather than immediately acting on the prediction, educators could conduct further assessments, such as interviews or performance reviews, to confirm the accuracy of the prediction before making any significant decisions. This approach helps prevent the misallocation of resources and the potential stigmatization of students who may have been incorrectly identified as at risk.

Educators can leverage the benefits of predictive ML models while exercising caution by focusing on specific aspects of prediction, such as TPR or FPR. Furthermore, our findings underscore the trade-off between fairness and performance. Adhering strictly to the four-fifth rule as recommended by the literature would limit effective bias mitigation to the resampling family of techniques due to their ability to drastically reduce predictive bias. However, these techniques could come at the cost of decreased model performance, making them impractical in specific contexts like ours. Given the variability in dataset contexts and modeling approaches across studies, adopting a universal cutoff may not be the most pragmatic approach. Instead, researchers should tailor their bias mitigation strategies to suit the specific characteristics of their dataset and model. This flexible approach ensures that bias is effectively addressed without compromising model performance unnecessarily.

Limitations and Future Directions

There are several noteworthy limitations in this study. First, unconventional data-splitting proportions were utilized. This study employed a train–test split ratio of 50/50, differing from the more common ratios like 70/30 or 80/20. While larger training datasets generally lead to more accurate predictions, the conventional split ratio was not feasible due to the algorithms used. Second, to maintain equal proportions between training and testing datasets, the testing dataset was augmented along with the training dataset. This approach may distort the representation of the testing dataset relative to real-world scenarios. In the context of this study, it is typical for outcome variables such as high school dropout rate to be imbalanced due to its rare occurrence [68,69]. However, augmenting only the training dataset could make it unequal to the testing dataset because the data augmentation algorithm involves altering the structure of the original dataset [62]. Because the bias mitigation procedure in the DALEX package is streamlined between training the original model with the training dataset, assessing its bias with the testing dataset, and re-training the model with the training dataset again with additional bias mitigation features such as sample weights (from the reweighting technique) or a resampled dataset (from the resampling technique), it is impossible to leave the testing dataset unequal to the training dataset, as it may confuse the algorithm [14]. This limitation implies that the bias mitigation algorithms could not work with an imbalanced dataset with the same effectiveness as a balanced dataset. This point could be used to improve the fairness module in DALEX in the future. Third, the classifier employed in this study is basic, lacking the sophistication of more powerful predictive algorithms such as the random forest model [27]. While the chosen classifier served the study’s purpose, employing more advanced models could yield superior predictive results. These limitations underscore the need for further research and improvement in bias mitigation techniques. Addressing these challenges could enhance the effectiveness and applicability of bias mitigation algorithms.

While acknowledging these limitations, it is essential to note that this study’s primary focus was not on creating a realistic dataset by maintaining the originality of the testing data, or utilizing high-performance classifiers like boosted trees or random forests to achieve the highest possible predictive accuracy. Instead, the study aims to demonstrate the application of bias mitigation techniques within a specific context. However, we recognize these limitations to guide future research. It would be beneficial for future studies to explore the consequences of employing recommendations from bias-mitigated ML models to inform real-world decisions. Since bias is defined mathematically in this study, it would be intriguing to investigate how individuals integrate information derived from bias mitigation algorithms into their decision-making processes to reduce bias in practical contexts. Moreover, future research could compare the efficacy of employing traditional cutoffs, such as the four-fifth rule, with customized cutoffs, as demonstrated in this study. This comparative analysis could shed light on which approach is more effective in practice, thereby informing best practices in bias mitigation. It is essential to recognize that every dataset is unique, and the effectiveness of bias mitigation techniques may vary depending on the specific context. Therefore, we encourage future research to try implementing the bias mitigation techniques utilized in this study, as well as other techniques that may be available, to datasets from other domains for further exploration. This study does not universally discredit the reweighting method or assert that resampling always reduces classifier performance drastically. Instead, it aims to illustrate the functioning of each method. Therefore, researchers are encouraged to use a trial-and-error approach to identify the most suitable method for their particular context, considering the unique characteristics of their dataset and model. This iterative process will contribute to advancing our understanding of bias mitigation techniques and their practical implementation.

Author Contributions

Conceptualization, T.W. and O.B.; methodology, T.W.; data curation, T.W. and O.B.; writing—original draft preparation, T.W., J.X.L., E.M. and O.B.; writing—review and editing, T.W. and O.B.; visualization, T.W.; supervision, O.B.; resources, T.W., J.X.L., E.M. and O.B. All authors have read and agreed to the published version of the manuscript.

Funding

This study receives no funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available through the National Center for Education Statistics website (accessed on 6 January 2023): https://nces.ed.gov/surveys/hsls09/. The corresponding author can make the codes available upon request.

Conflicts of Interest

The authors declare no conflicts of interest in the writing of the manuscript.

Abbreviations

The following abbreviations are used in this manuscript:

ACC	Accuracy
AUC	Area Under Curve
COMPAS	Correctional Offender Management Profiling for Alternative Sanctions
DALEX	Model Agnostic Language for Exploration and Explanation
FPR	False Positive Rate
HSLS:09	High School Longitudinal Study of 2009
ML	Machine Learning
PPV	Positive Predictive Value
ROC	Reject Option-based Classification
RUS	Random Undersampling
SD	Standard Deviation
SES	Socio-economic Status
SMOTE-NC	Synthetic Minority Oversampling Technique for nominal and categorical data
STP	Statistical Parity
TPR	True Positive Rate

Appendix A. List of Utilized Variables

Type	Variable Code	Variable Name
Continuous	X1SES	Students’ socio-economic status composite score
	X1MTHEFF	Students’ mathematics self-efficacy
	X1MTHINT	Students’ interest in fall 2009 math course
	X1SCIUTI	Students’ perception of science utility
	X1SCIEFF	Students’ science self-efficacy
	X1SCIINT	Students’ interest in fall 2009 math course
	X1SCHOOLBEL	Students’ sense of school belonging
	X1SCHOOLENG	Students’ school engagement
	X1SCHOOLCLI	Scale of school climate assessment
	X1COUPERTEA	Scale of counselor’s perceptions of teacher’s expectations
	X1COUPERCOU	Scale of counselor’s perceptions of counselor’s expectations
	X1COUPERPRI	Scale of counselor’s perceptions of principal’s expectations
	X3TGPA9TH	Students’ GPA in ninth grade
Categorical	S1HROTHHOMWK	Hours spent on homework/studying on typical school day
	X1MOMEDU	Mother’s/female guardian’s highest level of education
	X1DADEDU	Father’s/male guardian’s highest level of education
	X1SEX	What is [your 9th grader name]’s sex?

References

Barocas, S.; Hardt, M.; Narayanan, A. Fairness and Machine Learning: Limitations and Opportunities; MIT Press: Cambridge, MA, USA, 2023. [Google Scholar]
Crawford, K. The Trouble with Bias. 2017. Available online: https://www.youtube.com/watch?v=fMym_BKWQzk (accessed on 13 May 2024).
Shin, T. Real-Life Examples of Discriminating Artificial Intelligence. Towards Data Science. 2020. Available online: https://towardsdatascience.com/real-life-examples-of-discriminating-artificial-intelligence-cae395a90070 (accessed on 13 May 2024).
Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, Cambridge, MA, USA, 8–10 January 2012; pp. 214–226. [Google Scholar] [CrossRef]
Chen, G.; Rolim, V.; Mello, R.F.; Gašević, D. Let’s shine together!: A comparative study between learning analytics and educational data mining. In Proceedings of the Tenth International Conference on Learning Analytics & Knowledge, Frankfurt, Germany, 23–27 March 2020; pp. 544–553. [Google Scholar] [CrossRef]
Gardner, J.; O’Leary, M.; Yuan, L. Artificial intelligence in educational assessment: ‘Breakthrough? or buncombe and ballyhoo?’. J. Comput. Assist. Learn. 2021, 37, 1207–1216. [Google Scholar] [CrossRef]
Baker, R.S.; Hawn, A. Algorithmic Bias in Education. Int. J. Artif. Intell. Educ. 2022, 32, 1052–1092. [Google Scholar] [CrossRef]
Akgun, S.; Greenhow, C. Artificial intelligence in education: Addressing ethical challenges in K-12 settings. AI Ethics 2022, 2, 431–440. [Google Scholar] [CrossRef] [PubMed]
Lepri, B.; Oliver, N.; Letouzé, E.; Pentland, A.; Vinck, P. Fair, transparent, and accountable algorithmic decision-making processes: The premise, the proposed solutions, and the open challenges. Philos. Technol. 2018, 31, 611–627. [Google Scholar] [CrossRef]
Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. 2022, 54, 1–35. [Google Scholar] [CrossRef]
Pessach, D.; Shmueli, E. Algorithmic fairness. In Machine Learning for Data Science Handbook; Rokach, L., Maimon, O., Shmueli, E., Eds.; Springer International Publishing: Cham, Switzerland, 2023; pp. 867–886. [Google Scholar] [CrossRef]
Caton, S.; Haas, C. Fairness in machine learning: A survey. ACM Comput. Surv. 2023, 56, 1–38. [Google Scholar] [CrossRef]
Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 2017, 5, 153–163. [Google Scholar] [CrossRef]
Baniecki, H.; Kretowicz, W.; Piatyszek, P.; Wisniewski, J.; Biecek, P. dalex: Responsible machine learning with interactive explainability and fairness in Python. J. Mach. Learn. Res. 2021, 22, 1–7. [Google Scholar]
Wiśniewski, J.; Biecek, P. Hey, ML Engineer! Is Your Model Fair? Available online: https://docs.mlinpl.org/virtual-event/2020/posters/11-Hey_ML_engineer_Is_your_model_fair.pdf (accessed on 30 May 2024).
Mashhadi, A.; Zolyomi, A.; Quedado, J. A Case Study of Integrating Fairness Visualization Tools in Machine Learning Education. In Proceedings of the Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (CHI EA ′22), New York, NY, USA, 30 April–5 May 2022. [Google Scholar] [CrossRef]
Baniecki, H.; Kretowicz, W.; Piatyszek, P.; Wisniewski, J.; Biecek, P. Module Dalex.Fairness. 2021. Available online: https://dalex.drwhy.ai/python/api/fairness/ (accessed on 30 May 2024).
Mohanty, P.K.; Das, P.; Roy, D.S. Predicting daily household energy usages by using Model Agnostic Language for Exploration and Explanation. In Proceedings of the 2022 OITS International Conference on Information Technology (OCIT), Bhubaneswar, India, 14–16 December 2022; pp. 543–547. [Google Scholar] [CrossRef]
Binns, R. Fairness in machine learning: Lessons from political philosophy. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency; Friedler, S.A., Wilson, C., Eds.; Proceedings of Machine Learning Research (PMLR): New York, NY, USA, 2018; Volume 81, pp. 149–159. Available online: https://proceedings.mlr.press/v81/binns18a.html (accessed on 30 May 2024).
Srivastava, M.; Heidari, H.; Krause, A. Mathematical notions vs. Human perception of fairness: A descriptive approach to fairness for machine learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2459–2468. [Google Scholar] [CrossRef]
Mitchell, S.; Potash, E.; Barocas, S.; D’Amour, A.; Lum, K. Algorithmic fairness: Choices, assumptions, and definitions. Annu. Rev. Stat. Its Appl. 2021, 8, 141–163. [Google Scholar] [CrossRef]
Feldman, M.; Friedler, S.A.; Moeller, J.; Scheidegger, C.; Venkatasubramanian, S. Certifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; pp. 259–268. [Google Scholar] [CrossRef]
Kilkenny, M.F.; Robinson, K.M. Data quality: “Garbage in–garbage out”. Health Inf. Manag. J. 2018, 47, 103–105. [Google Scholar] [CrossRef]
Fernando, M.; Cèsar, F.; David, N.; José, H. Missing the missing values: The ugly duckling of fairness in machine learning. Int. J. Intell. Syst. 2021, 36, 3217–3258. [Google Scholar] [CrossRef]
Caton, S.; Malisetty, S.; Haas, C. Impact of imputation strategies on fairness in machine learning. J. Artif. Intell. Res. 2022, 74, 1011–1035. [Google Scholar] [CrossRef]
Mahesh, B. Machine learning algorithms—A review. Int. J. Sci. Res. (IJSR) 2020, 9, 381–386. [Google Scholar] [CrossRef]
Roßbach, P. Neural Networks vs. Random Forests—Does It Always Have to Be Deep Learning? 2018. Available online: https://blog.frankfurt-school.de/wp-content/uploads/2018/10/Neural-Networks-vs-Random-Forests.pdf (accessed on 13 May 2024).
Li, H. Which machine learning algorithm should I use? SAS Blogs, 9 December 2017. [Google Scholar]
Yang, L.; Shami, A. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 2020, 415, 295–316. [Google Scholar] [CrossRef]
Visalakshi, S.; Radha, V. A literature review of feature selection techniques and applications: Review of feature selection in data mining. In Proceedings of the 2014 IEEE International Conference on Computational Intelligence and Computing Research, Coimbatore, India, 18–20 December 2014; pp. 1–6. [Google Scholar] [CrossRef]
Friedman, B.; Nissenbaum, H. Bias in computer systems. ACM Trans. Inf. Syst. 1996, 14, 330–347. [Google Scholar] [CrossRef]
Dobbe, R.; Dean, S.; Gilbert, T.; Kohli, N. A broader view on bias in automated decision-making: Reflecting on epistemology and dynamics. arXiv 2018. [Google Scholar] [CrossRef]
Prakash, K.B. Analysis, prediction and evaluation of COVID-19 datasets using machine learning algorithms. Int. J. Emerg. Trends Eng. Res. 2020, 8, 2199–2204. [Google Scholar] [CrossRef]
Haas, C. The price of fairness—A framework to explore trade-offs in algorithmic fairness. In Proceedings of the International Conference on Information Systems (ICIS), Munich, Germany, 15–18 December 2019. [Google Scholar]
Briscoe, E.; Feldman, J. Conceptual complexity and the bias/variance tradeoff. Cognition 2011, 118, 2–16. [Google Scholar] [CrossRef]
Speicher, T.; Heidari, H.; Grgic-Hlaca, N.; Gummadi, K.P.; Singla, A.; Weller, A.; Zafar, M.B. A unified approach to quantifying algorithmic unfairness: Measuring individual & group unfairness via inequality indices. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 2239–2248. [Google Scholar] [CrossRef]
Corbett-Davies, S.; Pierson, E.; Feller, A.; Goel, S.; Huq, A. Algorithmic decision making and the cost of fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ′17), Halifax, NS, Canada, 13–17 August 2017; pp. 797–806. [Google Scholar] [CrossRef]
Veale, M.; Van Kleek, M.; Binns, R. Fairness and accountability design needs for algorithmic support in high-stakes public sector decision-making. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; pp. 1–14. [Google Scholar] [CrossRef]
Fernández-Delgado, M.; Cernadas, E.; Barro, S.; Amorim, D. Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 2014, 15, 3133–3181. [Google Scholar]
Noble, S.U. Algorithms of Oppression: How Search Engines Reinforce Racism; New York University Press: New York, NY, USA, 2018. [Google Scholar]
Selbst, A.D.; Boyd, D.; Friedler, S.A.; Venkatasubramanian, S.; Vertesi, J. Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019; pp. 59–68. [Google Scholar] [CrossRef]
Morozov, E. To Save Everything, Click Here: The Folly of Technological Solutionism, 1st ed.; PublicAffairs: New York, NY, USA, 2013. [Google Scholar]
Weerts, H.; Dudík, M.; Edgar, R.; Jalali, A.; Lutz, R.; Madaio, M. Fairlearn: Assessing and improving fairness of AI systems. J. Mach. Learn. Res. 2023, 24, 1–8. [Google Scholar]
Veale, M.; Binns, R. Fairer machine learning in the real world: Mitigating discrimination without collecting sensitive data. Big Data Soc. 2017, 4, 205395171774353. [Google Scholar] [CrossRef]
Vartan, S. Racial bias found in a major health care risk algorithm. Scientific American, 24 October 2019. [Google Scholar]
Larson, J.; Mattu, S.; Kirchner, L.; Angwin, J. How we analyzed the COMPAS recidivism algorithm. ProPublica, 23 May 2016. [Google Scholar]
Biecek, P.; Burzykowski, T. Explanatory Model Analysis; Chapman and Hall/CRC: New York, NY, USA, 2021. [Google Scholar]
Hardt, M.; Price, E.; Srebro, N. Equality of opportunity in supervised learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), Barcelona, Spain, 5–10 December 2016; pp. 3323–3331. [Google Scholar] [CrossRef]
Zafar, M.B.; Valera, I.; Gomez Rodriguez, M.; Gummadi, K.P. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Proceedings of the 26th International Conference on World Wide Web, Perth, Australia, 3–7 April 2017; 2017; pp. 1171–1180. [Google Scholar] [CrossRef]
Bobko, P.; Roth, P.L. The four-fifths rule for assessing adverse impact: An arithmetic, intuitive, and logical analysis of the rule and implications for future research and practice. In Research in Personnel and Human Resources Management; Emerald Group Publishing Limited: Bingley, UK, 2004; pp. 177–198. [Google Scholar]
Hobson, C.J.; Szostek, J.; Griffin, A. Adverse impact in black student 6-year college graduation rates. Res. High. Educ. 2021, 39, 1–15. [Google Scholar]
Raghavan, M.; Kim, P.T. Limitations of the “four-fifths rule” and statistical parity tests for measuring fairness. Georget. Law Technol. Rev. 2023, 8. Available online: https://ssrn.com/abstract=4624571 (accessed on 29 May 2024).
Watkins, E.A.; McKenna, M.; Chen, J. The four-fifths rule is not disparate impact: A woeful tale of epistemic trespassing in algorithmic fairness. arXiv 2022, arXiv:2202.09519. [Google Scholar]
Kamiran, F.; Calders, T. Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 2012, 33, 1–33. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Kamiran, F.; Karim, A.; Zhang, X. Decision theory for discrimination-aware classification. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining, Brussels, Belgium, 10–12 December 2012; pp. 924–929. [Google Scholar] [CrossRef]
Fernández, A.; García, S.; Galar, M.; Prati, R.C.; Krawczyk, B.; Herrera, F. Cost-sensitive learning. In Learning from Imbalanced Data Sets; Springer International Publishing: Cham, Switzerland, 2018; pp. 63–78. [Google Scholar] [CrossRef]
National Center for Educational Statistics [NCES]. High School Longitudinal Study of 2009; NCES: Washington, DC, USA, 2016. [Google Scholar]
Van Rossum, G.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
Nicoletti, M.d.C. Revisiting the Tinto’s theoretical dropout model. High. Educ. Stud. 2019, 9, 52–64. [Google Scholar] [CrossRef]
Bulut, O.; Wongvorachan, T.; He, S. Enhancing High-School Dropout Identification: A Collaborative Approach Integrating Human and Machine Insights. Manuscript Submitted for Publication. 2024. Available online: https://www.researchsquare.com/article/rs-3871667/v1(accessed on 30 May 2024).
He, H.; Ma, Y. (Eds.) Imbalanced learning: Foundations, Algorithms, and Applications; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2013. [Google Scholar]
Islahulhaq, W.W.; Ratih, I.D. Classification of non-performing financing using logistic regression and synthetic minority over-sampling technique-nominal continuous (SMOTE-NC). Int. J. Adv. Soft Comput. Its Appl. 2021, 13, 116–128. [Google Scholar] [CrossRef]
Canbek, G.; Sagiroglu, S.; Temizel, T.T.; Baykal, N. Binary classification performance measures/metrics: A comprehensive visualized roadmap to gain new insights. In Proceedings of the 2017 International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey, 5–8 October 2017; pp. 821–826. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16), San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Chollet, F. Keras. 2015. Available online: https://keras.io (accessed on 30 May 2024).
Sun, Y.; Li, Z.; Li, X.; Zhang, J. Classifier selection and ensemble model for multi-class imbalance learning in education grants prediction. Appl. Artif. Intell. 2021, 35, 290–303. [Google Scholar] [CrossRef]
Barros, T.M.; SouzaNeto, P.A.; Silva, I.; Guedes, L.A. Predictive models for imbalanced data: A school dropout perspective. Educ. Sci. 2019, 9, 275. [Google Scholar] [CrossRef]
Márquez-Vera, C.; Cano, A.; Romero, C.; Ventura, S. Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl. Intell. 2013, 38, 315–330. [Google Scholar] [CrossRef]

Figure 1. The reweighting process. Note. The reweighting process operates by applying Equation (3) to each unique combination of the subgroup and predicted class. Then, the model is re-trained with sample weights to potentially balance the representativeness of each sample subgroup.

Figure 2. The resampling process. Note.

\hat{y}

denotes probability of the predicted value. The expected sample size for each subgroup and class combination is calculated from sample weights. Then, the algorithm duplicates or removes cases based on the difference between the expected and the actual sample size.

Figure 2. The resampling process. Note.

\hat{y}

denotes probability of the predicted value. The expected sample size for each subgroup and class combination is calculated from sample weights. Then, the algorithm duplicates or removes cases based on the difference between the expected and the actual sample size.

Figure 3. The ROC Pivot process. (Note: Each data point represents one case. This function assumes that the target variable’s positive class (1) indicates a favorable outcome. Asterisk (*) indicates cells to be pivoted in the low-confidence region).

Figure 4. The predictive bias plot of the baseline model. (Note. The bar charts represent TPR, PPV, FPR, ACC, and STP, respectively, from top to bottom. The red area indicates the critical zone where predictive bias is detected. The vertical line indicates a 1.0 boundary).

Figure 5. Radar chart of parity loss metric of the baseline model.

Figure 6. The predictive bias plot of all conditions. (Note: the bar charts represent TPR, PPV, FPR, ACC, and STP, respectively, from top to bottom. The red area indicates the critical zone where predictive bias is detected. The vertical line indicates the 1.0 boundary).

Figure 7. Radar chart of parity loss metric of all conditions. (Note: the reweighting condition has identical results to the baseline condition).

Figure 8. The visualization of model performance of all bias mitigation conditions. (Note: the reweighting condition has identical results to the baseline condition).

Table 1. Critical comparison of the three bias mitigation techniques.

Method	Advantages	Disadvantages
Reweighting	Involves merely adding weights to the machine learning algorithm. Minimal changes to the existing ML workflow.	Different ML algorithm requires different arguments to apply the technique.
Resampling	Can be applied consistently across different machine learning algorithms	Potential data representativeness issue due to dataset alteration.
ROC Pivot	Does not alter the algorithm or the dataset, only the results.	There are no predefined guidelines for selecting the low-confidence region.

Table 2. The metric ratio of all bias mitigation conditions.

Condition	TPR	ACC	PPV	FPR	STP
Baseline	0.895	0.992	1.025	0.529	0.768
Reweighting	0.895	0.992	1.025	0.529	0.768
Uniform Resampling *	1.024	0.993	0.857	1.056	1.054
Preferential Resampling *	0.969	0.883	0.865	0.994	0.986
ROC Pivot *	0.903	0.995	1.023	0.549	0.776

Note. Parameter

ϵ

was set to 0.53, and therefore, metrics should be within (0.53, 1.887). The reweighting condition did not produce changes from the baseline condition. Asterisk (*) indicates the conditions with no detected bias.

Table 3. Parity loss of all bias mitigation conditions.

Condition	TPR	ACC	PPV	FPR	STP
Baseline	0.110	0.007	0.025	0.636	0.264
Reweighting	0.110	0.007	0.025	0.636	0.264
Uniform Resampling *	0.024	0.007	0.154	0.055	0.053
Preferential Resampling *	0.031	0.124	0.145	0.006	0.014
ROC Pivot *	0.102	0.005	0.023	0.600	0.253

Note. Asterisk (*) indicates the condition with no bias detected.

Table 4. Assigned sample weights from the reweighting method.

Sex (Code)	Predicted Class (Code)	Sample Weight
Female (1)	Non-dropout (0)	0.9942
Male (0)	Dropout (1)	1.0057
Female (1)	Dropout (1)	0.9937
Male (0)	Non-dropout (0)	1.0065

Table 5. The model performance of all bias mitigation conditions.

Condition	FPR	Accuracy	F1 Score
Baseline	0.636	0.841	0.826
Reweighting	0.636	0.841	0.826
Uniform Resampling *	0.055	0.459	0.428
Preferential Resampling *	0.006	0.469	0.613
ROC Pivot *	0.600	0.842	0.827

Note. Asterisk (*) indicates the condition with no bias detected.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wongvorachan, T.; Bulut, O.; Liu, J.X.; Mazzullo, E. A Comparison of Bias Mitigation Techniques for Educational Classification Tasks Using Supervised Machine Learning. Information 2024, 15, 326. https://doi.org/10.3390/info15060326

AMA Style

Wongvorachan T, Bulut O, Liu JX, Mazzullo E. A Comparison of Bias Mitigation Techniques for Educational Classification Tasks Using Supervised Machine Learning. Information. 2024; 15(6):326. https://doi.org/10.3390/info15060326

Chicago/Turabian Style

Wongvorachan, Tarid, Okan Bulut, Joyce Xinle Liu, and Elisabetta Mazzullo. 2024. "A Comparison of Bias Mitigation Techniques for Educational Classification Tasks Using Supervised Machine Learning" Information 15, no. 6: 326. https://doi.org/10.3390/info15060326

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparison of Bias Mitigation Techniques for Educational Classification Tasks Using Supervised Machine Learning

Abstract

1. Introduction

2. Literature Review

2.1. Fairness in Machine Learning: Definition, Causes of Bias, and Dilemmas

2.1.1. Definition

2.1.2. Causes of Bias

2.1.3. Dilemmas

2.2. Potential Harms and Consequences from Lack of Fairness in ML

2.3. The Fairness Module in DALEX

2.3.1. Fairness Check

2.3.2. Bias Mitigation Techniques

Reweighting

Resampling

ROC Pivot

3. Present Study

4. Methods

4.1. Dataset and Data Preprocessing

4.2. Classification Model

4.3. Bias Detection and Mitigation

5. Results

5.1. Classification Results

5.2. Bias Detection and Mitigation

5.2.1. Bias Detection Results

5.2.2. Bias Mitigation Results

5.2.3. Performance Comparison between Bias Mitigation Conditions

6. Discussion and Conclusions

Limitations and Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. List of Utilized Variables

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI