Binary Classification with Imbalanced Data

Chiang, Jyun-You; Lio, Yuhlong; Hsu, Chien-Ya; Ho, Chia-Ling; Tsai, Tzong-Ru

doi:10.3390/e26010015

Open AccessArticle

Binary Classification with Imbalanced Data

by

Jyun-You Chiang

¹,

Yuhlong Lio

²

,

Chien-Ya Hsu

³,

Chia-Ling Ho

⁴

and

Tzong-Ru Tsai

^3,*

¹

School of Statistics, Southwestern University of Finance and Economics, Chengdu 611130, China

²

Department of Mathematical Sciences, University of South Dakota, Vermillion, SD 57069, USA

³

Department of Statistics, Tamkang University, New Taipei City 251301, Taiwan

⁴

Department of Risk Management and Insurance, Tamkang University, New Taipei City 251301, Taiwan

^*

Author to whom correspondence should be addressed.

Entropy 2024, 26(1), 15; https://doi.org/10.3390/e26010015

Submission received: 15 November 2023 / Revised: 21 December 2023 / Accepted: 21 December 2023 / Published: 22 December 2023

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

When the binary response variable contains an excess of zero counts, the data are imbalanced. Imbalanced data cause trouble for binary classification. To simplify the numerical computation to obtain the maximum likelihood estimators of the zero-inflated Bernoulli (ZIBer) model parameters with imbalanced data, an expectation-maximization (EM) algorithm is proposed to derive the maximum likelihood estimates of the model parameters. The logistic regression model links the Bernoulli probabilities with the covariates in the ZIBer model, and the prediction performance among the ZIBer model, LightGBM, and artificial neural network (ANN) procedures is compared by Monte Carlo simulation. The results show that no method can dominate the other methods regarding predictive performance under the imbalanced data. The LightGBM and ZIBer models are more competitive than the ANN model for zero-inflated-imbalanced data sets.

Keywords:

artificial neural network; expectation-maximization algorithm; Entropy; logistic regression; zero-inflated model

1. Introduction

When a data set contains an excess of zero counts over the expectation of a standard statistical distribution, a zero-inflated model helps to analyze such data sets. In many scenarios, count data have two sources of zeros. They are the structural zeros and sampling zeros. Structural zeros mean that the response variable cannot take positive values because of inherent constraints or conditions, and sampling zeros mean the zeros result from random chance. The idea behind the zero-inflated model is to account for two sources of zeros using a mixture of two separated distributions. One model characterizes the probability of structural zeros, and the other describes the non-zero counts. This mixture mechanism of the zero-inflated model allows for better representing the complexities when the data set has excess zeros. The zero-inflated Poisson (ZIP) model and zero-inflated negative Binomial (ZINB) model are two major zero-inflated models.

Imbalanced data refers to the distribution of categories in the data set that is highly disproportionate. For example, in a data set with two types, one type has significantly more instances than another. The situation results in an unequal representation of the two categories. Data sets could be inherently imbalanced in real-world scenarios for various reasons. One example of imbalanced data in medical applications can be diagnosing a rare disease with only a tiny percentage of patients with the condition. In this case, the class size of the rare event will be severely smaller than the common event.

Traditional machine learning algorithms and statistical models could be challenged to handle imbalanced data. The prediction results could be biased toward the majority class and lead to poor performance of the minority class. However, the minority class is often more significantly interesting in many applications. The well-known drawbacks of machine learning algorithms caused by imbalanced data sets are model bias, poor generalization, and misleading evaluation metrics. Several techniques can be used to address the issue of imbalanced data, including resampling, synthetic data generation, using ensemble methods, and more.

The artificial neural network (ANN) has been widely used to implement categorical classification. Rosenblatt [1] proposed a theory of perception about a hypothetical nervous system that can sense, remember, and recognize information from the physical world. Rumelhart et al. [2] discussed a learning algorithm for backpropagation ANNs. The algorithm adjusts the weights of the connections in the network to minimize the error between the realistic output and the desired output. Hirose et al. [3] modified the backpropagation algorithm by varying the number of hidden units in the ANN. Their algorithm can reduce the probability of becoming trapped in local minima and speed up the convergence compared to conventional backpropagation ANN. Sietsma and Dow [4] studied the relationship between the size and structure of an ANN and its ability to generalize from a training set. Using a complex classification task, they investigated the effects of pruning and noise training on network performance.

In the previous decade, Devi et al. [5] emphasized the classification of cervical cancer using ANN methods to develop a strategy for accurately categorizing cervical cancer cases. Sze et al. [6] provided a comprehensive tutorial and survey on the efficient processing of deep neural networks. Abiodun et al. [7] surveyed the ANN applications to provide a taxonomy of ANN models and provided readers with knowledge of current and emerging trends in ANN research. Nasser and Abu-Naser [8] addressed the detection of lung cancer using an ANN method to explore the feasibility of employing ANN technology for identifying lung cancer, which is a significant advancement in medical diagnostics. Muhammad et al. [9] delved into predicting pancreatic cancer using an ANN method. Umar [10] predicted student academic performance using ANN methods to forecast students’ academic achievements. Shen et al. [11] introduced a novel ensemble classification model for imbalanced credit risk evaluation. The model integrates neural networks with a classifier optimization technique. The study addressed the challenge of imbalanced data sets in credit risk assessment by leveraging the strengths of neural networks and advanced optimization techniques.

Lambert [12] proposed a ZIP regression model to process count data with excess zeros. The author applied the ZIP regression model to a soldering defects data set on the printed wiring board and compared the performance of the proposed model with other models. Hall [13] presented a case study of ZIP and zero-inflated binomial (ZIB) regression with random effects. The author showed that zero-inflated models are useful when the data contains an excess of zeros, and random effects can be used to account for the correlation between observations within clusters. Cheung [14] discussed the theoretical underpinnings of zero-inflated models, provided examples of their application, and discussed the implications of using these models for accurate inference in the context of growth and development studies. Gelfand and Citron-Pousty [15] presented how to use zero-inflated models for spatial count data within the environmental and ecological statistics field. They explored how zero-inflated models can account for both the spatial correlation and the excess zero counts often found in environmental and ecological applications. Rodrigues [16] discussed how Bayesian techniques can be applied to estimate the parameters of zero-inflated models and provided practical examples to demonstrate the approach.

Ghosh et al. [17] provided a Bayesian analysis of zero-inflated regression models. Their Bayesian approach suggested a coherent way to handle the challenges of excess zeros in count data, offering insights into parameter estimation and uncertainty quantification. Harri and Zhao [18] introduced the zero-inflated-ordered probit model to handle ordered response data with excess zeros. The proposed method is applied to analyze tobacco consumption data. Loeys et al. [19] explored the analysis of zero-inflated count data, going beyond the scope of ZIP regression. Diop et al. [20] studied the maximum likelihood estimation method in the logistic regression model with a cure fraction. The authors introduced the concept of cure fraction models, which analyze data where many individuals do not experience the event of interest. Staub and Winkelmann [21] studied the consistent estimation of zero-inflated count models. He et al. [22] discussed the concept of structural zeros and their relevance to zero-inflated models. Diop et al. [23] studied simulation-based inference in a zero-inflated Bernoulli regression model. This research is valuable for practitioners working with binary data exhibiting excess zeros.

Alsabti et al. [24] present a novel decision tree classifier called CLOUDS for classification using big data. The authors showed that techniques such as discretization and data set sampling can be used to scale up decision tree classifiers to large data sets. However, both methods can cause a significant accuracy loss. Friedman [25] introduced Gradient Boosting Machines (GBMs) with a powerful technique for building predictive models by iteratively combining weak learners. GBMs are known for their effectiveness in handling complex, high-dimensional data with robustness against overfitting. Jin and Agrawal [26] addressed the challenge of constructing decision trees efficiently in a parallel computing environment. They improved communication and memory efficiency during the parallel construction of decision trees.

Chen and Guestrin [27] developed the “XGBoost”, a scalable tree-boosting system. XGBoost is a powerful and efficient algorithm for boosting decision trees. Ke et al. [28] introduced “LightGBM”, a highly efficient gradient boosting decision tree algorithm. This tool dramatically contributes to machine learning, particularly boosting techniques. LightGBM has gained attention for its speed and scalability, making it suitable for handling large data sets and complex problems. Wang et al. [29] applied “LightGBM” for the miRNA classification in breast cancer patients. Ma et al. [30] used multi-observation and multi-dimensional data cleaning methods and the algorithms of LightGBM and XGboost for prediction. The authors concluded that the LightGBM algorithm based on multiple observational data set classification prediction results is the best. Machado et al. [31] introduced LightGBM, showing it is an excellent decision tree gradient boosting method. The authors applied LightGBM to predict customer loyalty within the finance industry. Minstireanu and Mesnita [32] employed the LightGBM algorithm for online click fraud detection. Daoud [33] compared three popular gradient boosting algorithms, XGBoost, LightGBM, and CatBoost, to evaluate the strengths and weaknesses of each algorithm in the context of credit risk assessment or related tasks.

Some open challenges for implementing imbalanced data analysis have been comprehensively discussed by Krawczyk [34] and Nalepa and Kawulok [35]. The two works provided insights into imbalanced data modeling. Krawczyk [34] pointed out the open challenges including the binary and multi-class classification problems, how to do multi-label and multi-instance learning, and unsupervised and semi-supervised handling for imbalanced data sets. It is also challenging to perform regression on skewed cases and a well-designed procedure for learning from imbalanced data streams under stationary and drifting environments, then extending the model for large-scale and big-data cases. Nalepa and Kawulok [35] reviewed the issue of selecting training sets for support vector machines. They did an extensive survey on existing methods for using the support vector machine method to train data in big-data applications. Nalepa and Kawulok [35] separated these helpful techniques into several categories. This work helps users understand the underlying ideas behind the algorithms used.

Even a ZIBer regression model has been proposed in the literature. How to obtain the maximum likelihood estimates (MLEs) of the ZIBer regression model parameters can be a gap in this topic. We are also interested in studying the performance of the ZIBer regression model compared with machine learning classifiers, for example, the LighGBM and ANN methods. In this study, we propose an EM algorithm to obtain reliable MLEs of the ZIBer regression model parameters based on zero-inflated-imbalanced binary data. The logistic model links the event or non-event probability with explanatory variables. The innovation is to propose the theoretical process of the EM algorithm to obtain reliable MLEs for the ZIBer regression model using imbalanced data sets. Monte Carlo simulations were conducted to show the predictive performance of the ZIBer, LightGBM, and ANN methods for binary classification under imbalanced data.

The rest of this article is organized as follows. Section 2 presents the ZIBer regression model and the proposed EM procedure that produces reliable MLEs of the model parameters. In Section 3, two examples are used to illustrate the applications of using the proposed EM algorithm to the ZIBer regression model. In Section 4, Monte Carlo simulations were conducted to compare the classification rate of the ZIBer model, LighGBM, and ANN methods in three measures. The design of Monte Carlo simulations and their implementation are studied in this section. Some concluding remarks are given in Section 5.

2. Zero-Inflated Bernoulli Regression Model and EM Algorithm

Consider a special case about the infection status of a specific disease.

Y_{i}

denotes the infection status for the specific disease of the i-th individual in a sample of size n;

Y_{i} = 1

if the individual is infected and

Y_{i} = 0

otherwise. The conditional distribution of

Y_{i}

can be a Bernoulli distribution, denoted by

Y_{i} | x_{i} \sim B (1, p_{i})

, where

p_{i} = P (Y_{i} = 1 | x_{i})

is the conditional probability of

Y_{i} = 1

, given a

p \times 1

vector of explanatory variables,

x_{i}^{T} = (x_{i 1} = 1, x_{i 2}, \dots, x_{i p})

. The logistic function to link

Y_{i}

and

x_{i}

can be expressed by

log (\frac{p_{i}}{1 - p_{i}}) = x_{i}^{T} β = β_{1} x_{i 1} + β_{2} x_{i 2} + \dots + β_{p} x_{i p}, i = 1, 2, \dots, n,

(1)

where

β^{T} = (β_{1}, β_{2}, \dots, β_{p})

is the coefficient vector. Failure to account for the extra zeros in a zero-inflated model can result in biased estimates and inferences.

Let

δ_{i}

control the probability of

Y_{i}

being a structural zero.

δ_{i}

can be an unconditional probability of

Y_{i}

being a structural zero controlled by an unobserved factor, say

w_{i}

. When

w_{i} = 1

,

Y_{i} = 0

, otherwise,

w_{i} = 0

,

Y_{i}

can be 0 or 1, determined by the model

B (1, p_{i})

, where

p_{i} = P (Y_{i} = 1 | x_{i}))

. It can be shown that the unobserved factor

w_{i} \sim B (1, δ_{i})

for

i = 1, 2, \dots, n

. Two situations result in

Y_{i} = 0

; the first situation is that

Y_{i}

is a structural zero with probability

δ_{i}

, and the other is the situation that

Y_{i}

is not a structural zero but has the probability

1 - p_{i}

to be zero. The unconditional probability of

Y_{i} = 0

can be presented by

P (Y_{i} = 0) = δ_{i} + (1 - δ_{i}) (1 - p_{i}) = 1 - (1 - δ_{i}) p_{i} = 1 - π_{i}, i = 1, 2, \dots, n,

(2)

where

π_{i} = (1 - δ_{i}) p_{i}

. The unconditional probability of

Y_{i} = 1

can be presented by

\begin{matrix} P (Y_{i} = 1) = 1 - P (Y_{i} = 0) = π_{i}, i = 1, 2, \dots, n . \end{matrix}

(3)

We can obtain the unconditional distribution of

Y_{i}

by

Y_{i} \sim B (1, π_{i})

.

Using the second logistic function to link

δ_{i}

and the

k \times 1

covariate vector of

z_{i}^{T} = (z_{i 1} = 1, z_{i 2}, \dots, z_{i k})

, we obtain

log (\frac{δ_{i}}{1 - δ_{i}}) = z_{i}^{T} θ = θ_{1} z_{i 1} + θ_{2} z_{i 2} + \dots + θ_{p} z_{i k}, i = 1, 2, \dots, n,

(4)

where

θ^{T} = (θ_{1}, θ_{2}, \dots, θ_{k})

is the vector of model parameters. After algebraic formulation, we can show that

\begin{matrix} π_{i} = \frac{1}{1 + e^{z_{i}^{T} θ}} \times \frac{e^{x_{i}^{T} β}}{1 + e^{x_{i}^{T} β}}, i = 1, 2, \dots, n . \end{matrix}

(5)

Assume that sample

D = (y_{1}, \dots, y_{n}, x_{1}, \dots, x_{n}, z_{1}, \dots, z_{n})

is available. The likelihood function can be constructed as follows:

\begin{matrix} L (β, θ | D) & = \prod_{i = 1}^{n} π_{i}^{y_{i}} {(1 - π_{i})}^{1 - y_{i}} \\ = \prod_{i = 1}^{n} {(\frac{e^{x_{i}^{T} β}}{(1 + e^{z_{i}^{T} θ}) (1 + e^{x_{i}^{T} β})})}^{y_{i}} \times {(\frac{1 + e^{z_{i}^{T} θ} + e^{z_{i}^{T} θ + x_{i}^{T} β}}{(1 + e^{z_{i}^{T} θ}) (1 + e^{x_{i}^{T} β})})}^{1 - y_{i}} . \end{matrix}

(6)

After algebraic formulation, we can obtain the log-likelihood equation,

\begin{matrix} ℓ (β, θ | D) & = \sum_{i = 1}^{n} y_{i} x_{i}^{T} β + (1 - y_{i}) log (1 + e^{z_{i}^{T} θ} + e^{x_{i}^{T} β + z_{i}^{T} θ}) \\ - \sum_{i = 1}^{n} log (1 + e^{x_{i}^{T} β}) - \sum_{i = 1}^{n} log (1 + e^{z_{i}^{T} θ}) . \end{matrix}

(7)

To obtain reliable maximum likelihood estimators, it could be challenging to directly maximize the log-likelihood function in Equation (7). So, we suggest an EM algorithm to maximize the log-likelihood function.

Assume that an unobserved probability

w_{i}

can determine whether

Y_{i}

is a structural zero or not; when

w_{i} = 1

implies that

Y_{i} = 0

is a structural zero, and

w_{i} = 0

implies that the Bernoulli distribution determines the response

Y_{i}

is 1 or 0. The likelihood function based on complete data

D

can be represented by

\begin{matrix} L (β, θ | D) & = \prod_{i = 1}^{n} f (w_{i} | z_{i}) f (y_{i} | w_{i}, x_{i}) \\ = \prod_{i = 1}^{n} f (w_{i} = 1 | z_{i}) \prod_{i = 1}^{n} f (w_{i} = 0 | z_{i}) f (y_{i} | w_{i} = 0, x_{i}) . \end{matrix}

(8)

The log-likelihood function can be represented by

\begin{matrix} L (β, θ | D, w) = \prod_{i = 1}^{n} e^{w_{i} z_{i}^{T} θ} {(1 + e^{z_{i}^{T} θ})}^{- 1} e^{y_{i} (1 - w_{i}) x_{i}^{T} β} {(1 + e^{z_{i}^{T} θ})}^{- (1 - w_{i})}, \end{matrix}

(9)

and the log-likelihood function can be obtained by

\begin{matrix} ℓ (β, θ | D, w) = ℓ_{c} (θ | w, z) + ℓ_{c} (β | D, w), \end{matrix}

(10)

where

\begin{matrix} ℓ_{c} (θ | w, z) & = \sum_{i = 1}^{n} w_{i} z_{i}^{T} θ - log (1 + e^{z_{i}^{T} θ}) \end{matrix}

(11)

and

\begin{matrix} ℓ_{c} (β | D, w) & = \sum_{i = 1}^{n} (1 - w_{i}) \{y_{i} x_{i}^{T} β - log (1 + e^{x_{i}^{T} β})\} . \end{matrix}

(12)

Using Equations (11) and (12) to implement the following E-step and M-step until convergences of the estimates of parameters,

θ

and

β

, are obtained.

E-step:: Iteration $(h + 1)$ : Estimate $w_{i}$ by its posterior mean $w_{i}^{(h + 1)}$ given the estimates $β^{(h)}$ and $θ^{(h)}$ . If $y_{i} = 1$ , then $w_{i}^{(h + 1)} = 0$ ; otherwise,

$\begin{matrix} w_{i}^{(h + 1)} & = \frac{P (Y_{i} = 0 | A) P (A)}{P (Y_{i} = 0 | A) P (A) + P (Y_{i} = 0 | A^{c}) P (A^{c})} \end{matrix}$

(13)

$\begin{matrix} = \frac{1 \times δ_{i}}{1 \times δ_{i} + (1 - p_{i}) (1 - δ_{i})} = \frac{δ_{i}}{1 - p_{i} (1 - δ_{i})}, \end{matrix}$

(14)

where A and $A^{c}$ denote that $Y_{i}$ has a structure zero and is from a Bernoulli distribution.
M-step:: Maximizing

$\begin{matrix} ℓ_{c} (θ | z, w^{(h + 1)}) = \sum_{i = 1}^{n} w_{i}^{(h + 1)} z_{i}^{T} θ^{(h)} - log (1 + e^{z_{i}^{T} θ^{(h)}}) \end{matrix}$

(15)

to obtain $θ^{(h + 1)}$ , and maximizing

$\begin{matrix} ℓ_{c} (β | D, w^{(h + 1)}) = \sum_{i = 1}^{n} (1 - w_{i}) \{y_{i} x_{i}^{T} β^{(h)} - log (1 + e^{x_{i}^{T} β^{(h)}})\} \end{matrix}$

(16)

to obtain $β^{(h + 1)}$ .

3. Examples

3.1. Example 1

A diabetes data set is used for illustration. The data set can be obtained from the R package version 2.1–3.1 “mlbench”. There are 768 records with eight independent variables including

The number of pregnancies.
The glucose concentration in the 2-h oral glucose tolerance test.
The diastolic blood pressure in mm Hg.
The triceps skin-fold thickness in millimeters.
Two-hour serum insulin in mu U/mL.
Body mass.
Diabetes pedigree function.
Age in years.

The response variable is Diabetes, which can be negative or positive. After removing incomplete records, we have 392 records for the final sample for illustration. We labeled all the independent variables by

x_{i}

,

i = 1, 2, \dots, 8

. Moreover, the variables of “The diastolic blood pressure in mm Hg”, “Body mass”, and “Age” are selected as the independent variables, labeled by

z_{h}

,

h = 1, 2, 3

to develop the second logistic regression model. The diabetes rate is 0.332 in this example.

All independent variables are scaling with

x_{i j}^{*} = \frac{x_{i j} - min (x_{j})}{max (x_{j}) - min (x_{j})}

and

z_{i h}^{*} = \frac{z_{i h} - min (z_{h})}{max (z_{h}) - min (z_{h})}

for

i = 1, 2, \dots, n

. Using the proposed EM algorithm, we can obtain the following ZIBer models,

\begin{matrix} log (\frac{p_{i}}{1 - p_{i}}) & = - 7.562 + 2.665 x_{i 1}^{*} + 6.931 x_{i 2}^{*} + 1.99 x_{i 3}^{*} - 0.19 x_{i 4}^{*} + 3.34 x_{i 5}^{*} \\ + 5.246 x_{i 6}^{*} + 5.313 x_{i 7}^{*} - 2.863 x_{i 8}^{*}, \end{matrix}

(17)

and

\begin{matrix} log (\frac{δ_{i}}{1 - δ_{i}}) & = 0.005 + 2.232 z_{i 1}^{*} - 1.94 z_{i 2}^{*} - 10.494 z_{i 3}^{*}, i = 1, 2, \dots, 392 . \end{matrix}

(18)

When the estimated probability

p_{i} \geq 0.53

,

i = 1, 2, \dots, 392

, the corresponding person is identified for diabetes. Based on the obtained model, the accuracy is 0.806. We want to study how much efficiency is lost using the typical logistic regression model. Based on the typical logistic regression model, we obtain the following estimation function form,

\begin{matrix} log (\frac{p_{i}}{1 - p_{i}}) & = - 5.771 + 1.397 x_{i 1}^{*} + 5.434 x_{i 2}^{*} - 0.122 x_{i 3}^{*} + 0.628 x_{i 4}^{*} - 0.687 x_{i 5}^{*} \\ + 3.449 x_{i 6}^{*} + 2.664 x_{i 7}^{*} + 2.037 x_{i 8}^{*}, i = 1, 2, \dots, 392 . \end{matrix}

(19)

When the estimated probability

p_{i} \geq 0.39

,

i = 1, 2, \dots, 392

, the corresponding person is identified for diabetes. Based on the obtained model, the accuracy is 0.798. The efficiency of the typical logistic regression model can be improved, but the improvement is not significant based on only 0.008 increment on the Accuracy. The diabetes rate is 0.332. If the success rate is high, the improvement of replacing the typical logistic regression model with the ZIBer regression model could not be significant. The finding makes sense because the imbalance in this example is insignificant. The R codes using the proposed EM algorithm to obtain the MLEs of the ZIBer regression model parameters are given in Appendix A. We use the best cut-rate of 0.53 to predict the value of Diabetes and identify it as positive if

p_{i} \geq 0.53

and 0 otherwise. We can obtain the Accuracy, Sensitivity, and Specificity as 0.806, 0.608, and 0.905, respectively.

3.2. Example 2

A Taiwan credit data set is used for illustration. This data set can be obtained from the UC Irvine Machine Learning Repository by the hyperlink https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients; this data set was donated to UC Irvine Machine Learning Reponsitory on 25 January 2016. The response variable is the default payment, Yes = 1 for default and Yes = 0 for non-default. There are 30,000 records with the following 23 explanatory variables:

The Given Credit Amount of NT dollars, including the individual consumer and his (or her) family (supplementary) credits.
Gender, 1 for male and 2 for female.
Education, 1 for graduate school, 2 for university, 3 for high school, and 4 for other.
Marital Status, 1 for married, 2 for single, and 3 for other.
Age in years.
The history of past payments in 2005 from September to April in decent order of the columns. The measurement scale is −1 for pay duly, 1 for a one-month payment delay, 2 for a two-month payment delay, …, 8 for an eight-month payment delay, and 9 for a nine-month payment delay and above.
The number of bill statements in 2005 with NT dollars from September to April in decent order of the columns.
The amount of previous payments in 2005 with NT dollars from September to April in decent order of the columns.

The response variable of Default Payment is labeled by Y. We use the columns of The Given Credit Amount of NT dollars, Gender, Education, Marital Status, Age, and the payment statuses of April, May, June, July, and August as the independent variables, labeled by

X_{1}

,

X_{2}

, …, and

X_{10}

, respectively, to develop the ZIBer regression model. In many instances, we need to subjectively determine the independent variables of

Z^{'} s

for establishing the second logistic regression model. In this example, we select the columns of Gender, Education, Marital Status, and Age as the independent variables of

Z^{'} s

for the second logistic regression model.

No missing records were found in this data set. First, we searched the rows with labels that were not well-defined in the independent variables and removed all records that were not well-defined from the data set. Finally, 4030 records are used in the illustrative data set for constructing the ZIBer regression model.

The summary of the credit card users in the categories of Default Payment, Gender, Education, and Marriage is given below.

1436 default and 2594 non-default credit card users are included in the illustrative data set. The proportion is 35.63%.
2385 female and 1645 male credit card users are in the illustrative data set. The proportions are 59.18% and 40.82%, respectively.
The number of credit card users in the education categories of Senior High School, College, Graduate, and Others are 624, 1713, 1683, and 10, respectively. The proportions are 15.48%, 42.51%, 41.76%, and 0.25%.
2076 credit card users are married, 1920 credit card users are single, and 34 credit card users are other. The proportions are 51.51%, 47.64%, and 0.85%.

Before model construction, we transform the categorical variable with three or larger categories into dummy variables. Hence, the Marital Status needs to be transformed into two dummy variables. If

X_{4} = 1, 2

, and 3, the new dummy variables of M1 and M2 can be obtained by

(M_{1}, M_{2}) = (1, 0), (0, 1)

and (1,1), respectively. Let

Z_{1} = X_{2}

,

Z_{2} = X_{3}

,

Z_{3} = M_{1}

,

Z_{4} = M_{2}

, and

Z_{5} = X_{5}

. All independent variables are scaling with

x_{i j}^{*} = \frac{x_{i j} - min (x_{j})}{max (x_{j}) - min (x_{j})}

and

z_{i h}^{*} = \frac{z_{i h} - min (z_{h})}{max (z_{h}) - min (z_{h})}

for

i = 1, 2, \dots, n

. Using the proposed EM algorithm, we can obtain the following ZIBer models,

\begin{matrix} log (\frac{p_{i}}{1 - p_{i}}) & = 0.639 - 1.266 x_{i 1}^{*} - 0.458 x_{i 2}^{*} + 0.554 x_{i 3}^{*} - 3.713 M_{i 1}^{*} - 3.039 M_{i 2}^{*} \\ + 1.789 x_{i 5}^{*} + 0.179 x_{i 6}^{*} + 4.520 x_{i 7}^{*} + 6.366 x_{i 8}^{*} - 0.908 x_{i 9}^{*} + 3.876 x_{i 10}^{*}, \end{matrix}

(20)

and

\begin{matrix} log (\frac{δ_{i}}{1 - δ_{i}}) & = 4.860 - 0.441 z_{i 1}^{*} + 0.024 z_{i 2}^{*} - 3.970 z_{i 3}^{*} - 3.754 z_{i 4}^{*} - 0.495 z_{i 5}^{*}, \\ i = 1, 2, \dots, 4030 . \end{matrix}

(21)

When the estimated probability

p_{i} \geq 0.41

,

i = 1, 2, \dots, 4030

, the corresponding customer is identified as a default. Based on the obtained model, the Accuracy is 0.911.

4. Monte Carlo Simulations

In this section, we would like to study the performance of the proposed ZIBer regression method and compare its performance with that of the ANN and LightGBM methods. To help readers have a comprehensive understanding of the Monte Carlo simulation design, the design of the Monte Carlo simulation section is explained in Section 4.1, and the implementation of performance comparison among models can be found in Section 4.2.

4.1. The Simulation Design

Before implementing Monte Carlo simulations, we need to generate imbalanced data. Hence, we need a baseline model to do a fair performance comparison. First, we generate zero-inflated samples. Assume that

w_{1}, w_{2}, \dots, w_{n}

is a random sample from the uniform distribution over 0 and 1. if

w_{i} \leq δ_{i}

, let

Y_{i} = 0

and

Y_{i}

follows a Bernoulli distribution with probability

π_{i} = (1 - δ_{i}) p_{i}

for

i = 1, 2, \dots, n

. In the simulation studies, we use three covariates: first logistic model,

x_{i 1} = 1

, generate

x_{i 2}

from the standard normal distribution, and generate

x_{i 3}

from the Bernoulli distribution with probability 0.75. Let

β_{1} = - 1

,

β_{2} = 1

and

β_{3} = - 2

. Present the corresponding logistic model by

\begin{matrix} log (\frac{p_{i}}{1 - p_{i}}) = β_{1} x_{i 1} + β_{2} x_{i 2} + β_{3} x_{i 3} = β_{1} + β_{2} x_{i 2} + β_{3} x_{i 3}, i = 1, 2, \dots, n . \end{matrix}

We can obtain the value of

p_{i}

from the equation,

\begin{matrix} p_{i} = \frac{e^{β_{1} + β_{2} x_{i 2} + β_{3} x_{i 3}}}{1 + e^{β_{1} + β_{2} x_{i 2} + β_{3} x_{i 3}}}, i = 1, 2, \dots, n . \end{matrix}

Using two

z^{'} s

for the second logistic model, let

z_{i 1} = 1

and generate

z_{i 2}

from the Weibull distribution with a shape parameter 3.6 and scale parameter 1 for

i = 1, 2, \dots, n

. Let

θ_{1} = - 1

,

θ_{2} = 2

. Present the corresponding logistic model by

\begin{matrix} log (\frac{δ_{i}}{1 - δ_{i}}) = θ_{1} z_{i 1} + θ_{2} z_{i 2} = θ_{1} + θ_{2} z_{i 2}, i = 1, 2, \dots, n . \end{matrix}

we can obtain the value of

δ_{i}

from the equation,

\begin{matrix} δ_{i} = \frac{e^{θ_{1} + θ_{2} z_{i 2}}}{1 + e^{θ_{1} + θ_{2} z_{i 2}}}, i = 1, 2, \dots, n . \end{matrix}

Then, we can generate

Y_{i}

from a Bernoulli distribution with probability

π_{i} = (1 - δ_{i}) p_{i}

,

i = 1, 2, \dots, n

. Repeat the above process n times to obtain a zero-inflated random sample

D

.

4.2. The Implementation of Performance Comparison

To guarantee that the sample size is large enough to make the machine learning method perform stable, we consider

n = 5000

in this simulation study. The ZIBer algorithm can reach the stop rule if 50 iterative runs are up or all the parameter differences between two adjacent iterations are less than

10^{- 5}

. For the comparison purpose, we use

80 %

or 4000 generated data to train the LightGBM and ANN models and obtain the MLEs of ZIBer model parameters. R codes were prepared to implement the prediction performance comparison. When the MLEs of the model parameters are obtained, the model global searches for the best cut point with the lowest error classification rate as the training ZIBer model. Because the separation of training and testing groups depends on random chance, we repeated the whole process 1000 times and calculated the average performance of accuracy, sensitivity, and specificity,

\begin{matrix} Accuracy & = \frac{TP + TN}{TP + TN + FP + FN}, \end{matrix}

(22)

\begin{matrix} Sensitivity & = \frac{TP}{TP + FN}, \end{matrix}

(23)

\begin{matrix} Specificity = & \frac{TN}{TN + FP}, \end{matrix}

(24)

which TP denotes true positive, FN denotes false negative, FP denotes false positive, and TN denotes true negative. The simulation results are given in Table 1. The dispersion of the metrics Accuracy, Sensitivity, and Specificity of 1000 repetitions is given in Figures 3–5.

Table 2 shows that no method dominates the others. These three methods have similar performance for Accuracy and Specificity. The differences between these three methods are almost ignored. However, we found that the ANN method has extreme results for Sensitivity and Specificity. The average sensitivity based on 1000 repetitions is nearly 0, and the average Specificity based on 1000 repetitions is 1. The finding means the ANN methods could not be suitable for the case, emphasizing the Sensitivity measure. The Sensitivity in zero-inflated-imbalanced data is not easy to measure correctly. The LightGBM and ZIBer methods outperform the ANN methods in Sensitivity and Specificity measures. We also find that the ZIBer and LighGBM methods are competitive.

From Figure 1, Figure 2 and Figure 3, we can see the spread of 1000 measures of Accuracy, Sensitivity, and Specificity, respectively. Figure 1 indicates the Accuracy of LightGBM has a higher median than the other two competitors with a shorter dispersion than the ZIBer method. Figure 2 shows the strength of the ZIBer method with a higher median than the other two competitors and a similar dispersion with the LightGBM method. Figure 3 shows the ZIBer method has the lowest median and is widespread compared with the other two competitors.

Let us revisit Examples 1 and 2, these two data sets can be zero-inflated because the proportions of non-diabetes and non-default payments are 66.8% and 64.4%, respectively. Because the simulation design used is the same and comparison results are similar, Example 2 will be kept for explanation. In practical applications, some non-default credit card users have extremely low default probability. They could have a different model structure from users that could be default but not default in the current state. The performance of the ANN, LightGBM, and ZIBer models is compared using model validation. Randomly splitting the illustrative data set into the training data set with 80% observations, and the remaining 20% observations comprise the testing data set. Meanwhile, the random sampling maintains the same proportion of default payments in the training data set as the illustrative data set. The training data set is used to train the ANN and LightGBM models and construct the ZIBer model. Then, the testing data set is used to evaluate the Accuracy, Sensitivity, and Specificity. The split operation is repeated 300 times. Accuracy, Sensitivity, and Specificity are evaluated using their mean values in 300 repetitions. Moreover, the standard deviations of these three models in 300 repetitions are computed, too.

All computation results are given in Table 2, which shows that these three models are competitive. No one model dominates the others. The LightGBM model has the highest Accuracy and Specificity. The ZIBer model has the highest Sensitivity. We also note that the ZIBer model and LightGBM are competitive for the Accuracy measurement. The mean Sensitivity of ANN is significantly lower than the other two competitors and has the largest standard deviation. We also find that the standard deviation of the Sensitivity of LightGBM is over two times higher than that of the ZIBer model.

Figure 4 and Figure 5 show the scatter plots of 300 Sensitivity measurements versus the Accuracy measurements and 300 Sensitivity measurements versus the Specificity measurements for the ANN, LightGBM, and ZIBer models. These two scatter plots show that the Accuracy, Sensitivity, and Specificity measures of the ZIBer model spread at the top position among all three methods. In this example, the ZIBer model looks to have a more stable performance than its competitors.

The ZIBer method is a statistical model-based method. Hence, the ZIBer method does not require big data to train models. When the sample size is small, the ZIBer method can be expected to perform best. The ZIBer and LightGBM methods are more competitive than the ANN, and we recommend using them to process zero-inflated-imbalanced data.

5. Conclusions

This paper proposes the EM maximum likelihood estimation procedure to obtain reliable MLEs of the ZIBer model parameters. The EM algorithm is simple for practical use. The machine learning models of ANN and LightGBM are used for comparing the performance of the proposed ZIBer model in terms of Accuracy, Sensitivity, and Specificity. Monte Carlo Simulations were conducted, and the simulation results show that no method can dominate the other two methods under the zero-inflated-imbalanced data.

Two examples of diabetes and credit card defaults are used for illustration. We demonstrate how to use the proposed EM algorithm to obtain the reliable parameters of the ZIBer regression model. Moreover, we use a random sampling technique for Example 2 to study the performance of the ANN, LightGBM, and ZIBer models under zero-inflated-imbalanced binary data sets in Section 4. We find that the LightGBM and ZIBer models are competitive. The LightGBM has the highest Accuracy and Specificity, and the ZIBer model has the highest Sensitivity. The ANN model performs worse in the measure of Sensitivity than the two competitors. Overall, the ZIBer model has a more stable performance than its competitors.

The strength of a parametric statistical model is the functional form is clear, and the explanation between the response and explanatory variables is feasible. However, screening out critical explanatory variables from a large set in the model is challenging. Enhancing the abilities of the model with higher sensitivity of the ZIBer model is also challenged. These two issues could be studied in the future.

Author Contributions

Conceptualization, investigation, writing and editing, project administration, funding acquisition: T.-R.T.; validation, investigation, writing and editing: Y.L.; methodology: C.-Y.H.; investigation: C.-L.H. and J.-Y.C.; project administration, funding acquisition: T.-R.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council, Taiwan, grant number NSC 112-2221-E-032-038-MY2.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The research data can be free access by checking https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients, this data set was donated to UC Irvine Machine Learning Repository on 25 January 2016.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. R Codes for the EM Algorithm

References

Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 1958, 65, 386–408. [Google Scholar] [CrossRef] [PubMed]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Hirose, Y.; Yamashita, K.; Hijiya, S. Back-propagation algorithm which varies the number of hidden units. Neural Netw. 1991, 4, 61–66. [Google Scholar] [CrossRef]
Sietsma, J.; Dow, R.J. Creating artificial neural networks that generalize. Neural Netw. 1991, 4, 67–79. [Google Scholar] [CrossRef]
Devi, M.A.; Ravi, S.; Vaishnavi, J.; Punitha, S. Classification of cervical cancer using artificial neural networks. Procedia Comput. Sci. 2016, 89, 465–472. [Google Scholar] [CrossRef]
Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef]
Abiodun, O.I.; Jantan, A.; Omolara, A.E.; Dada, K.V.; Mohamed, N.A.; Arshad, H. State-of-the-art in artificial neural network applications: A survey. Heliyon 2018, 4, e00938. [Google Scholar] [CrossRef] [PubMed]
Nasser, I.M.; Abu-Naser, S.S. Lung cancer detection using artificial neural network. Int. J. Eng. 2019, 3, 17–23. [Google Scholar]
Muhammad, W.; Hart, G.; Nartowt, B.; Farrell, J.J.; Johung, K.; Liang, Y.; Deng, J. Pancreatic cancer prediction through an artificial neural network. Front. Artif. Intell. 2019, 2, 2. [Google Scholar] [CrossRef]
Umar, M.A. Student academic performance prediction using artificial neural networks: A case study. Int. J. Comput. Appl. 2019, 178, 24–29. [Google Scholar]
Shen, F.; Zhao, X.; Li, Z.; Li, K.; Meng, Z. A novel ensemble classification model based on neural networks and a classifier optimisation technique for imbalanced credit risk evaluation. Phys. A Stat. Mech. Its Appl. 2019, 526, 121073. [Google Scholar] [CrossRef]
Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 1992, 34, 1–14. [Google Scholar] [CrossRef]
Hall, D.B. Zero-inflated Poisson and binomial regression with random effects: A case study. Biometrics 2000, 56, 1030–1039. [Google Scholar] [CrossRef] [PubMed]
Cheung, Y.B. Zero-inflated models for regression analysis of count data: A study of growth and development. Stat. Med. 2002, 21, 1461–1469. [Google Scholar] [CrossRef] [PubMed]
Gelfand, A.E.; Citron-Pousty, S. Zero-inflated models with application to spatial count data. Environ. Ecol. Stat. 2002, 9, 341–355. [Google Scholar]
Rodrigues, J. Bayesian analysis of zero-inflated distributions. Commun. Stat.-Theory Methods 2003, 32, 281–289. [Google Scholar] [CrossRef]
Ghosh, S.K.; Mukhopadhyay, P.; Lu, J.C. Bayesian analysis of zero-inflated regression models. J. Stat. Plan. Inference 2006, 136, 1360–1375. [Google Scholar] [CrossRef]
Harris, M.N.; Zhao, X. A zero-inflated ordered probit model, with an application to modelling tobacco consumption. J. Econom. 2007, 141, 1073–1099. [Google Scholar] [CrossRef]
Loeys, T.; Moerkerke, B.; De Smet, O.; Buysse, A. The analysis of zero-inflated count data: Beyond zero-inflated Poisson regression. Br. J. Math. Stat. Psychol. 2011, 65, 163–180. [Google Scholar] [CrossRef]
Diop, A.; Diop, A.; Dupuy, J.-F. Maximum likelihood estimation in the logistic regression model with a cure fraction. Electron. J. Stat. 2011, 5, 460–483. [Google Scholar]
Staub, K.E.; Winkelmann, R. Consistent estimation of zero-inflated count models. Health Econ. 2012, 22, 673–686. [Google Scholar] [CrossRef] [PubMed]
He, H.; Tang, W.; Wang, W.; Crits-Christoph, P. Structural zeroes and zero-inflated models. Shanghai Arch. Psychiatry 2014, 26, 236–242. [Google Scholar] [PubMed]
Diop, A.; Diop, A.; Dupuy, J.-F. Simulation-based inference in a zero-inflated Bernoulli regression model. Commun. Stat.-Simul. Comput. 2016, 45, 3597–3614. [Google Scholar] [CrossRef]
Alsabti, K.; Ranka, S.; Singh, V. CLOUDS: A decision tree classifier for large datasets. Electr. Eng. Comput. Sci.-All Scholarsh. 1998, 41. Available online: https://surface.syr.edu/eecs/41 (accessed on 20 September 2010).
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Jin, R.; Agrawal, G. Communication and memory efficient parallel decision tree construction. In Proceedings of the 2003 SIAM International Conference on Data Mining, San Francisco, CA, USA, 1–3 May 2003. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. LightGBM: A highly efficient gradient boosting decision tree. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]
Wang, D.; Zhang, Y.; Zhao, Y. LightGBM: An effective miRNA classification method in breast cancer patients. In Proceedings of the 2017 International Conference on Computational Biology and Bioinformatics, New York, NY, USA, 18 October 2017; pp. 7–11. [Google Scholar]
Ma, X.; Sha, J.; Wang, D.; Yu, Y.; Yang, Q.; Niu, X. Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning. Electron. Commer. Res. Appl. 2018, 31, 24–39. [Google Scholar] [CrossRef]
Machado, M.R.; Karray, S.; De Sousa, I.T. LightGBM: An effective decision tree gradient boosting method to predict customer loyalty in the finance industry. In Proceedings of the 14th International Conference on Computer Science & Education (ICCSE), Toronto, ON, Canada, 19–21 August 2019. [Google Scholar]
Minstireanu, E.A.; Mesnita, G. Light GBM machine learning algorithm to online click fraud detection. J. Inf. Assur. Cyber Secur. 2019, 2019, 263928. [Google Scholar] [CrossRef]
Daoud, E.A. Comparison between XGBoost, LightGBM and CatBoost using a home credit dataset. Int. J. Comput. Inf. Eng. 2019, 13, 6–10. [Google Scholar]
Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
Nalepa, J.; Kawulok, M. Selecting training sets for support vector machines: A review. Artif. Intell. Rev. 2019, 52, 857–900. [Google Scholar] [CrossRef]

Figure 1. The box plots of accuracy for the ANN, LightGBM, and ZIBer models.

Figure 2. The box plots of sensitivity for the ANN, LightGBM, and ZIBer models.

Figure 3. The box plots of Specificity for the ANN, LightGBM, and ZIBer models.

Figure 4. The scatter plot of the Accuracy versus Sensitivity for the ANN, LightGBM, and ZIBer models.

Figure 5. The scatter plot of the Specificity versus Sensitivity for the ANN, LightGBM, and ZIBer models.

Table 1. Performance comparison among ZIBer, LightGBM, and ANN based on 1000 repetitions.

	Average Values
Methods	Accuracy	Sensitivity	Specificity
ANN	0.9554	0.0010	1
LightGBM	0.9596	0.1082	0.9994
ZIBer	0.9563	0.1256	0.9952

Table 2. Performance comparison among the ANN, LightGBM, and ZIBer models based on 100 split operations for the illustrative data set.

	Mean Values (Standard Deviation*100)
Methods	Accuracy	Sensitivity	Specificity
ANN	0.7402 (3.9677)	0.5400 (21.2622)	0.8513 (6.5206)
LightGBM	0.7718 (1.4739)	0.6042 (7.4150)	0.8647 (3.6120)
ZIBer	0.7692 (1.4731)	0.7020 (3.0221)	0.8064 (2.4367)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chiang, J.-Y.; Lio, Y.; Hsu, C.-Y.; Ho, C.-L.; Tsai, T.-R. Binary Classification with Imbalanced Data. Entropy 2024, 26, 15. https://doi.org/10.3390/e26010015

AMA Style

Chiang J-Y, Lio Y, Hsu C-Y, Ho C-L, Tsai T-R. Binary Classification with Imbalanced Data. Entropy. 2024; 26(1):15. https://doi.org/10.3390/e26010015

Chicago/Turabian Style

Chiang, Jyun-You, Yuhlong Lio, Chien-Ya Hsu, Chia-Ling Ho, and Tzong-Ru Tsai. 2024. "Binary Classification with Imbalanced Data" Entropy 26, no. 1: 15. https://doi.org/10.3390/e26010015

APA Style

Chiang, J.-Y., Lio, Y., Hsu, C.-Y., Ho, C.-L., & Tsai, T.-R. (2024). Binary Classification with Imbalanced Data. Entropy, 26(1), 15. https://doi.org/10.3390/e26010015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Binary Classification with Imbalanced Data

Abstract

1. Introduction

2. Zero-Inflated Bernoulli Regression Model and EM Algorithm

3. Examples

3.1. Example 1

3.2. Example 2

4. Monte Carlo Simulations

4.1. The Simulation Design

4.2. The Implementation of Performance Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A. R Codes for the EM Algorithm

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI