Rough-Fuzzy Based Synthetic Data Generation Exploring Boundary Region of Rough Sets to Handle Class Imbalance Problem

Naushin, Mehwish; Das, Asit Kumar; Nayak, Janmenjoy; Pelusi, Danilo

doi:10.3390/axioms12040345

Open AccessArticle

Rough-Fuzzy Based Synthetic Data Generation Exploring Boundary Region of Rough Sets to Handle Class Imbalance Problem

¹

Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Howrah 711103, India

²

P.G Department of Computer Science, Maharaja Sriram Chandra Bhanja Deo (MSCB) University, Baripada, Mayurbhanj 757003, India

³

CosteSant’agostino Campus, Department of Communication Sciences, University of Teramo, 64100 Teramo, Italy

^*

Authors to whom correspondence should be addressed.

Axioms 2023, 12(4), 345; https://doi.org/10.3390/axioms12040345

Submission received: 28 February 2023 / Revised: 22 March 2023 / Accepted: 28 March 2023 / Published: 31 March 2023

(This article belongs to the Special Issue Advances in Fuzzy Logic and Computational Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Class imbalance is a prevalent problem that not only reduces the performance of the machine learning techniques but also causes the lacking of the inherent complex characteristics of data. Though the researchers have proposed various ways to deal with the problem, they have yet to consider how to select a proper treatment, especially when uncertainty levels are high. Applying rough-fuzzy theory to the imbalanced data learning problem could be a promising research direction that generates the synthetic data and removes the outliers. The proposed work identifies the positive, boundary, and negative regions of the target set using the rough set theory and removes the objects in the negative region as outliers. It also explores the positive and boundary regions of the rough set by applying the fuzzy theory to generate the samples of the minority class and remove the samples of the majority class. Thus the proposed rough-fuzzy approach performs both oversampling and undersampling to handle the imbalanced class problem. The experimental results demonstrate that the novel technique allows qualitative and quantitative data handling.

Keywords:

class imbalanced problem; feature selection; rough set theory; fuzzy theory; resampling; machine learning

1. Introduction

With the growing interest of knowledge researchers and the increasing number of proposed solutions, the imbalanced data problem has become one of the most significant and challenging issues. Imbalance data refers to data sets where the target class has an uneven distribution of observations. Predictive modeling is challenging for class-imbalanced datasets because most classification models are built on the premise that each class should have an equal number of samples. As a result, models perform poorly in prediction, particularly for the minority class. This is a problem because, in general, the minority class is more significant. As a result, the issue is more susceptible to errors in classification for the minority class than the majority class. For example, if there is a dataset where for every 100 records, five are diagnosed with the disease, and if the model predicts all 100 patients with no disease, then the model’s accuracy becomes 95%. In reality, it is not working correctly. Considering this issue, it is clear that the cost of not recognizing a patient that suffers from a rare disease might lead to severe and irreversible consequences. Apart from this, there are numerous domains in which imbalanced class distribution occurs, some of which are fraudulent credit card transactions [1], detecting oil spills from satellite images [2], network intrusion detection [3], financial risk analysis [4], text categorization [5], and information filtering [6]. Indeed, the wide range of problem occurrences increases its significance and explains the efforts put into finding practical solutions.

By giving minority categories more weight, class imbalanced learning approaches [7] hope to lessen the bias in model learning that favours majority categories. The various strategies for handling class imbalance in object classification can be categorized into different groups like data-level [8], algorithm-level [9], and hybrid approach [10]. Nevertheless, the majority of them use traditional imbalanced algorithms, which cannot handle the severely unbalanced dataset. SMOTE (Synthetic Minority Over-Sampling Technique) [11], a sampling-based technique, was established in 2002 and aims to address the class imbalance issue. It is a widely used strategy because of its simplicity and efficiency. It combines oversampling and undersampling, but the oversampling method creates new minority class data instances using an algorithm rather than replicating existing minority class data. Gradually, many hybrid approaches [12,13,14] are proposed by the researchers by combining the SMOTE with different machine learning approaches. But there has yet to be much research on the issue of class imbalance problems in the field with uncertainties. In the proposed work, we have tried to explore the uncertainties integrating the concepts of rough set and fuzzy theory. We have collected a class-imbalanced dataset from Kaggle and preprocessed [15] it by removing data irregularities and discretizing the continuous-valued attributes before applying the Rough Set Theory (RST) to the dataset. Next, we use the RST-based feature selection algorithm to select the most relevant features and generate the dataset with the modified feature subset. The modified dataset is also unbalanced, and its values are discrete. The RST is applied now to find the target sets’ negative, positive, and boundary regions, which is the collection of the objects of majority and minority classes. The negative region contains noise or outliers, which are directly discarded from the system. The positive region objects are partitioned into groups based on their class labels, and the boundary region objects are uncertain or inconsistent. The uncertainty is measured using fuzzy theory, which gives the fuzzy membership values of each object by which it belongs to each group in the positive region. Finally, based on the positive region and the fuzzy membership values of the objects in the boundary region, the oversampling and undersampling are performed to balance the dataset class. The workflow diagram of the proposed methodology is depicted in Figure 1.

1.1. Literature Survey

Research improves the stratum of society, and handling the data imbalance problem [16] can directly or indirectly improve the exploration, boosting community. In the actual world, the datasets may not be balanced [17]. A few times, these imbalanced data can influence the execution of a model or the outcome vitally. Researchers have considered various techniques for class imbalance issues. Sharma et al. [18] created a bilingual dataset and got the solution to analyze sentiment for class-imbalanced code-mixed data. They divided the corpus between training and testing data, preprocessed it using Levenshtein distance, and applied Synthetic Minority Oversampling Technique (SMOTE) with different Machine Learning classifiers to compare their performances. Srinivasan [19] combined SMOTE and Generative Adversarial Network (GAN) as SMOTified GAN in their proposed work. With SMOTified GAN, they tried to overcome the deficiency of SMOTE and GAN. To achieve this goal, they used transfer learning concepts. They collected the knowledge regarding minority classes using SMOTE and then applied it to the random generator function of GAN. Zhenchuan Li [20] used the credit card fraud dataset from Kaggle and the private data from a financial company in China. The researchers proposed a hybrid framework to handle class imbalance with the overlap. They used the divide step to get overlapping and non-overlapping subsets, and at conquer step, Artificial Neural Network (ANN) classifiers were used for an overlapping subset. M. Thanh Vo and Le [8] proposed a Fake Job Description Detection framework using the oversampling technique and applied it to the publicly available Employment Scam Agean Dataset (EMSCAD). This framework improved the prediction result of traditional classifiers. At preprocessing step, Support Vector Machine (SVM) and SMOTE were used to balance the training set, and the classification was done with a logistic regression model. Joel Jang [9] used the Internet Movie Database (IMDB) to propose a new training architecture. They partitioned the training data into mutually exclusive subsets and then performed continual learning on a deep learning-based classifier to handle the class imbalance problem. They used EWC to stabilize knowledge obtained from the previous partition and test the proposed method using the CNN+BiLSTM model. Lee [21] used Intrusion Detection Evaluation Dataset, CICIDS 2017, which contained regular traffic and 12 attacks. They compared classification performance without resampling of a Random Forest model, which classifies rare class data with classification performance after resampling based on GAN. Banerjee [22] used a dataset based on sarcastic and non-sarcastic tweets. They used SMOTE to preprocess the training dataset to handle imbalance during sarcasm detection. They used six different classification algorithms for experimental analysis of the effect of oversampling. Shafqat and Byun [23] used the data from an online shopping mall in Jeju. They proposed a hybrid GAN to solve the data imbalance problem and improve the recommendation system’s performance. Yafooz and Alsaeedi [24] used the Arabic dataset on Herbal Treatments for Diabetes to find out the opinions of youtube users on the videos related to diabetes. Analysis of user comments to get the view was done using the Multilabel Classification (MLC) model. They noted the impact on the performance of MLCs by normalization, stop word removal, Tokenization, and Arabic stemming. Among the techniques of oversampling, undersampling, and SMOTE they applied, SMOTE performed better. Sungho Suh [25] have used the MNIST (Modified National Institute of Standards and Technology) database to present classification enhancement GAN to improve generated synthetic minority data, thereby improving prediction accuracy. They had worked to reduce the ambiguity in cases where multiple similar classes hamper the classification accuracy. Feature extraction and clustering consider the relationship between ambiguous classes to get subsets of vague categories. They formulated a new loss function with multiple subsets of obscure class labels. Ali Shariq Imran [26] had taken CR23 and CR100K datasets from Coursera and Kaggle to address the data imbalance issue. To investigate the impact of text generation on sentiment analysis on a dataset, they employed two text generation models, SentiGAN and CatGAN. BLUE, NLLgen, and NLLdiv metrics were used to evaluate the quality of the generated text. Precision, recall, and the F-score were used to analyze the model’s performance in sentiment analysis. Mollas [27] presented ETHOS, a textual dataset with two variants, binary and multilabel. They proposed steps required to create a balanced dataset. They used a three-stage process to prepare the dataset, which included platform selection, data collection, data validation, then data configuration or preparation. After this, experiments were conducted using State-of-the-art (SOTA) techniques to determine this dataset’s performance. Chen et al. [28] applied the Rough Set Theory (RST) for feature selection in an imbalanced dataset. They measured the feature significance by computing the neighborhood set and general decision of each object. The relevance of each feature was calculated by considering the uneven distribution of the classes. After constructing a discernibility matrix and the generation of reducts, the significance of the features was used to determine a subset of conditional features. The particle swarm optimization (PSO) algorithm was used to determine the optimized parameters of the algorithm. Zhang et al. [29] had presented Multi-Imbalance, an open-source software package for multiclass imbalanced data classification. It had seven different categories of a multiclass imbalance learning algorithm. Behmanesh et al. [30] proposed an approach that used fuzzy rough set theory in weighted least square twin support vector machine (FRLSTSVM) to classify imbalanced data. To create a hyperplane, the data points from the minority class remain unchanged. Subsets of data points from the majority class were selected using the new method based on a fuzzy rough set to reduce majority points. The bias phenomenon to the majority class was overcome by embedding weight biases in the least squares TSVM using RST. In this work, the researchers have attempted to generate new data to balance the training data. However, they have yet to consider the irregularities and overlapping of the data and their features. Real-world problems usually contain ambiguous and uncertain data; thus, managing these has long been the main focus of research. Although there have been many contributions in that area, the concept of fuzzy sets marked the beginning of a focused endeavor. Since then, numerous fields have used fuzzy theory to deal with outliers. Fuzzy classifiers are well renowned for their ability to address the issue of outliers while also delivering much-needed performance resilience. For accurate prediction, the researchers developed a similarity classifier in their work [31] to address the Archimedean-Dombi aggregation operators, which are well known for offering sufficient generalization in aggregating data. There are many crisp oversampling techniques in the literature [32,33,34,35,36,37], which may not handle the class imbalance problem properly due to the presence of some uncertainties in the data. Fuzzy theory is very popularly used for solving such problems. Liu et al. [38] suggested an oversampling strategy based on the fuzzy theory that generates fuzzy rules and assigns weights to each rule to assess how much the sample belongs to a fuzzy space. Ultimately, the synthetic data is generated using weighted fuzzy rules. Ren et al. [39] suggested a fuzzy oversampling approach based on the chromosome theory of inheritance and affinity propagation. They have selected a representative sample of each class based on the significance of the samples. The representativeness of each sample was then determined using Mahalanobis distance. Finally, they created the synthetic data using the chromosomal theory of inheritance. Keeping all the issues and shortcomings in mind, we have proposed our rough-fuzzy approach to generate minority class samples and remove majority class samples by exploring the boundary and positive regions of the rough sets.

1.2. Objective

As the performance of most of the classification models is greatly affected by the class imbalance data, the learning of a predictive model from imbalanced class data has been a challenging task [17,40,41]. Also, the research and analysis have [20] revealed that underrepresented classes are not only the cause of performance loss but also the complex characteristics of data that include class overlapping and the presence of the outliers are equally responsible for reducing the classification performance. Despite recent attempts by researchers to solve the challenge of learning from severely skewed datasets [42,43], the problem still affects the model for the dataset with outliers and uncertainties. To properly learn the predictive model, our major goal in this research is to examine the uncertainty region on class imbalance data using rough set and fuzzy set theory. Thus, the paper’s main objective is to remove the negative region of the rough set as an outlier and explore the positive and boundary regions of the rough set with the help of fuzzy theory to generate the synthetic data of the minority class and to remove the uncertain or inconsistent data of the majority class.

1.3. Contribution

We have used the RST-based feature selection algorithm to select the most relevant features and generate the dataset with the modified feature subset. The RST is applied now to find the target sets’ negative, positive, and boundary regions, which is the collection of the objects of majority and minority classes. The negative region contains noise or outliers, which are directly discarded from the system. The positive region objects are partitioned into groups based on their class labels, and the boundary region objects are uncertain or inconsistent. The uncertainty is measured using fuzzy theory, which gives the fuzzy membership values of each object by which it belongs to each group in the positive region. Finally, based on the positive region and the fuzzy membership values of the objects in the boundary region, the oversampling and undersampling are performed to balance the dataset class. So, the main contributions of the paper are listed as follows:

The dataset is collected and preprocessed to remove the irregularities from the data. The processed dataset is discretized by an efficient discretization algorithm [44]. The discretized dataset is fed on the RST-based feature selection algorithm to retain only the relevant features of the dataset.
The negative, positive, and boundary regions of the target sets are identified, and the negative region is discarded as an outlier. The positive region is categorized into different groups based on the class labels of the objects. The fuzzy-membership values of each object in the boundary region are computed.
The rough-fuzzy based oversampling and undersampling method is proposed to generate the minority class objects and remove the majority class objects with the help of positive and boundary regions. During this method, the membership values computed using fuzzy theory take important roles to remove and generate new objects.
Finally, the method is validated by evaluating different performance measure metrics. Also, the method is compared with some related state-of-the-art methods with the help of the same metrics.

1.4. Summary

The rest of the paper is organized as follows: The preprocessing and RST-based feature selection technique is described in Section 2. Section 3 described the proposed rough-fuzzy-based sampling technique. The experimental result to demonstrate the method’s effectiveness is described in Section 4, and finally, the conclusions and future scopes of the paper are drawn in Section 5.

2. Preprocessing and Feature Selection

The proposed work collects the imbalance dataset from Kaggle and removes the irregularities and null values from the dataset. The different discretization algorithms [45] are explored on the dataset, and it is observed that the rectified Chi2 algorithm [44] works better for the collected imbalanced dataset. Thus in the proposed work, we have applied the rectified Chi2 algorithm proposed in [44] for discretizing the continuous-valued attributes of the datasets. There may be many features in the dataset that are not necessary for the training of the model. Sometimes these features during the training of a model may reduce the performance and raise the complexity. This increases the need to select essential features before generating a model. There are various feature selection techniques like [46,47,48] based on filter, wrapper, and embedded methods. The proposed work uses the RST-based filter method to withdraw duplicated, correlated, and redundant features. The following subsections describe the concepts of RST related to our work and the RST-based feature selection algorithm.

2.1. Rough Set Theory

The RST [49] is vital to deal with imprecise, inconsistent, and incomplete information and knowledge. Let

D S = (U, A, L)

be a decision system where U is a finite set of objects, A is a finite set of features, and L is a set of class labels. The basis of RST is the indiscernibility relation defined over a pair of objects in U. We have found the indiscernible sets of objects using the indiscernibility relation, defined in Equation (1). The sets are called the equivalence classes, which are disjoint from one another.

I N D (A) = {(x, y) \in U^{2} / \forall a \in A, a (x) = a (y)}

(1)

We have taken target sets based on class labels of the decision system. Let,

L = {l_{1}, l_{2}, \dots, l_{k}}

is the set of the k class labels. The universe U of objects can be divided into three disjoint regions: positive, boundary, and negative. These three regions are defined based on the lower and upper approximations of the target sets. Let

Q_{i}

be the target set of objects whose class label is

l_{i}

, for

i = 1, 2, \dots, k

. The lower approximation of

Q_{i}

with respect to the feature subset

P \subseteq A

is given by the Equation (2), where

P_{x}

is the equivalence class of objects which are indiscernible from the object x by the indiscernibility relation defined in Equation (1).

L_{P} (Q_{i}) = {x : \forall x \in U, P_{x} \subseteq Q_{i}}

(2)

Similarly, the upper approximation of

Q_{i}

with respect to the feature subset

P \subset A

is given by the Equation (3).

U_{P} (Q_{i}) = {x : \forall x \in U, P_{x} \cap Q_{i} \neq \emptyset}

(3)

The positive region,

P R_{P} (D S)

, for the information system

D S

with respect to the feature subset

P \subseteq A

is the union of the lower approximation of the target sets and is defined by the Equation (4).

P R_{P} (D S) = ⋃_{i = 1}^{k} L_{P} (Q_{i})

(4)

Similarly, the boundary Region,

B R_{P} (D S)

, and negative region,

N R_{P} (D S)

for the information system

D S

are defined by the Equations (5) and (6), respectively.

B R_{P} (D S) = ⋃_{i = 1}^{k} U_{P} (Q_{i}) - ⋃_{i = 1}^{k} L_{P} (Q_{i})

(5)

N R_{P} (D S) = U - ⋃_{i = 1}^{k} U_{P} (Q_{i})

(6)

Any object in the positive region definitely belongs to a single class, so the region is more informatic towards the classification. The boundary region objects have the uncertainty of belonging to a single class. So to extract some information regarding classification, we have applied fuzzy theory to compute the class belongingness of the objects in the boundary region. But the negative region is simply discarded as noise from the system as it is not required for classification purposes.

2.2. Feature Selection

The feature selection technique [50,51] is one of the most important tasks for machine learning researchers. It helps to reduce the complexity of the learning models by removing redundant or spurious features. If a feature subset P (

\subseteq A)

provides a distribution pattern of the dataset that is similar to that obtained considering the whole feature set A, then the subset P is sufficient to describe the decision system

D S

. Hence, the remaining features (i.e.,

A - P

) are redundant and must be removed from the dataset. This subset P must also be minimal in the sense that if we remove any one feature (say, a) from P, then the feature subset

P - {a}

is unable to provide the same distribution pattern. So, in terms of RST, we may say that the subset P is sufficient if it is minimal and provides the equivalence class structure similar to that obtained considering the whole feature set A. This subset P is termed as Reduct in RST. Thus, for a decision system,

D S = (U, A, L)

, a set

P \subseteq A

is called a reduct if (i) both P and A provide the same set of equivalence classes, and (ii) P is minimal, i.e., after removal of any feature a from P,

P - {a}

and A provide the different sets of equivalence classes. But, finding the exact reduct is an

N P -

hard problem, and in RST, an approximate solution is provided by the Quick Reduct generation algorithm [52,53]. In RST, we compute the dependency of L on a feature subset, say

P \subseteq A

, and this dependency (i.e.,

γ_{P} (L)

) is defined by Equation (7).

γ_{P} (L) = \frac{| ⋃_{i = 1}^{k} L_{P} (Q_{i}) |}{| U |}

(7)

In the Quick Reduct generation algorithm, first, we randomly select one feature, say a, from the set A, and let

P = {a}

. If

γ_{P} (L)

is equal to

γ_{A} (L)

then we terminate the algorithm; otherwise, we select any one feature, say b from the rest of A and set

P = P \cup {b}

. If

γ_{P} (L)

is not equal to

γ_{A} (L)

, we remove b from P and select the next feature from A and continue the process. The detailed RST-based feature selection algorithm used in work is described in Algorithm 1.

Algorithm 1: Rough Set Theory based Feature Selection(RSTFS).

3. Rough-Fuzzy Based Oversampling Technique

The Rough Set Theory based Feature Selection algorithm (RSTFS) provides the decision system

D S = (U, P, L)

, where U is the set of objects each described by P selected features, and L is the set of class labels. As there are

k -

class labels,

l_{1}, l_{2}, \dots, l_{k}

, so the positive region (

P R_{P}

) for

D S

(obtained using Equations (2) and (4)) is divided into

k -

clusters,

R_{1}, R_{2}, \dots, R_{k}

. Here, the

i -

th cluster

R_{i}

contains all the objects of class label

l_{i}

which definitely belong to the target set

Q_{i}

, for

i = 1, 2, \dots, k

. Let,

X_{i}

be the representative of

R_{i}

, for

i = 1, 2, \dots, k

. As the lower approximation set may be null, so some

R_{i}

may be empty. The boundary region (

B R_{P}

) of

D S

is identified by Equations (2), (3) and (5), where the objects are uncertain. In the proposed work, uncertainty is measured using the fuzzy theory [54,55], which gives the fuzzy membership values of each object in

B R_{P}

by which it belongs to each cluster in

P R_{P}

. Let, there are

n -

objects,

O_{1}^{'}, O_{2}^{'}, \dots, O_{n}^{'}

in

B R_{P}

. Then for each object

O_{i}^{'} \in B R_{P}

, find the membership value

F_{i j}

using Equation (8), where

d_{i j}

is the eucledian distance between

O_{i}^{'}

and

X_{j}

. This gives the belongingness of

O_{i}^{'}

in cluster

R_{j}

, for

j = 1, 2, \dots, k

. The less the

d_{i j}

, the greater the

F_{i j}

suggests, which implies that the belongingness of an object in the boundary region to a cluster

R_{j}

decreases with the distance from the cluster.

F_{i j} = {(\frac{d_{i j}}{\sum_{c = 1}^{k} d_{i c}})}^{- 1}

(8)

Thus for each object

O_{i}^{'}

, we get the membership vector

[F_{i 1}, F_{i 2}, \dots, F_{i k}]

, for

i = 1, 2, \dots, n

. So, we define an

n \times k

membership matrix,

F = {(F_{i j})}_{n \times k}

, where i-th row gives the membership values of belongingness of

i -

th object in

k -

different clusters in

P R_{P}

. Let the actual class of

O_{i}^{'}

in

B R_{P}

is

l_{i}

. The oversampling and undersampling process is performed in, Algorithm 2, with the help of

O_{i}^{'}

as follows:

Algorithm 2: Rough-Fuzzy based Class Balancing Method (

D S

).

$l_{i}$ is major class: Here, object $O_{i}^{'}$ is of majority class. If its fuzzy membership value for its own class $l_{i}$ is less than a threshold (say, $δ_{1}$ ), i.e., if $F_{i i} \leq δ_{1}$ , then we remove the object from the dataset, i.e., we perform undersampling. We are not losing valuable information because the object was in a region of uncertainty. On the other hand, if its fuzzy membership value for another class, say class $l_{j}$ (for $j = 1, 2, \dots, k$ and $j \neq i$ ) is greater than a threshold (say, $δ_{2}$ ), i.e., if $F_{i j} > δ_{2}$ then we are allowing $O_{i}^{'}$ to generate a synthetic object of class $l_{j}$ provided $l_{j}$ is of minor class. In this case, we create a new object $O_{i}^{j}$ of class $l_{j}$ from $O_{i}^{'}$ and the representative object $X_{j}$ of cluster $R_{j}$ . Thus we create one object from $O_{i}^{'}$ for each class $l_{j}$ where $F_{i j} > δ_{2}$ and $l_{j}$ are of minor class.
$l_{i}$ is minor class: Here, object $O_{i}^{'}$ is of minor class. Let $F_{i j} > δ_{2}$ for $n_{i}$ number of clusters. Then for each of these $n_{i}$ number of clusters, say, $R_{1}^{'}, R_{2}^{'}, \dots, R_{n_{i}}^{'}$ , a new object $O_{t}^{'}$ , for $t = 1, 2, \dots, n_{i}$ of class $l_{t}$ is created. So, if there is u number of objects in $B R_{P}$ of minor classes, then the total number of new objects created is $\sum_{i = 1}^{u} n_{i}$ .

How the new objects are generated from

O_{i}^{'}

with the help of the cluster representatives is described in Algorithm 3. After these two steps, it is desired that all minor classes will remain minor classes or be transformed into major classes. But if any minor class, say

l_{i}

, becomes a major class by making some of the major class a minor class, then we don’t allow the creation of all new objects of class

l_{i}

. In this case, without loss of generality, say

O_{1}, O_{2}, \dots, O_{v}

of

B R_{P}

could generate new objects of class

l_{i}

in cluster

R_{i}

. Then we arrange the membership values,

F_{1 i}, F_{2 i}, \dots, F_{v i}

in descending order and select at most first v values which makes

l_{i}

major class and discard the other values. Based on each of these v objects, say,

O_{j}^{'}

, for

j = 1, 2, \dots, v

and

X_{i}

(representative of cluster

R_{i}

) the new object of class

l_{i}

is created.

If a minor class, say

l_{i}

, remains minor, then we directly create objects of that class from the objects of the same class in

B R_{P}

using the same Algorithm 3. If

B R_{P}

contains

t -

objects of class

l_{i}

and we need to create T number of objects of class

l_{i}

to make it a major class, then we apply

⌊ \frac{T}{t} ⌋

times the Algorithm 3 for object

O_{j}

. This algorithm generates the new object

O_{j}^{'}

by the linear combination of

X_{i}

and

O_{j}

, as defined in Equation (9), where w gives the weightage of

X_{i}

. The value of w is selected randomly in between [0.8, 1.0) and considers

⌊ \frac{T}{t} ⌋

times to generate

⌊ \frac{T}{t} ⌋

new objects of class

l_{i}

. The larger value of w indicates that the new object

O_{j}^{'}

is more similar to the representative

X_{i}

of cluster

R_{i}

than the object

O_{j}

. For example, if the objects in the dataset are two-dimensional (2-D) and let

O_{j} = (2, 5)

and

X_{i} = (7, 2)

, then say, for

w = 0.9

,

O_{j}^{'} = 0.9 (7, 2) + 0.1 (2, 5) = (6.5, 2.3)

. Here, the distance of

O_{j}^{'}

from

X_{i}

is 0.583 unit and that of

O_{j}^{'}

from

O_{j}

is 5.247 unit. So the object

O_{j}^{'}

is very closed to

X_{i}

compare to

O_{j}

.

O_{j}^{'} = w X_{i} + (1 - w) O_{j}

(9)

Algorithm 3: Oversampling(

O_{i}^{'}, X_{j}

).

Input:

O_{i}^{'}

, and

X_{j}

/*

O_{i}^{'}

is a

P -

dimensional object, and

X_{j}

is the representative of the set

R_{j}

of objects of class

l_{j}

*/
Output:

O_{i}^{j}

, the generated object
Let

X_{j} = (x_{j 1}, x_{j 2}, \dots, x_{j d})

;
Let

O_{i}^{'} = (O_{i 1}^{'}, O_{i 2}^{'}, \dots, O_{i d}^{'})

;

X_{i} \leftarrow X_{j}

;

O_{j} \leftarrow O_{i}^{'}

;

w \leftarrow r a n d (0.8, 1.0)

;

O_{i}^{j} = w X_{i} + (1 - w) O_{j}

;
return

O_{i}^{j}

;

Thus based on the fuzzy membership values of the objects in the boundary region of rough sets, both removal and augmentation of data are done. In brief, we can say that the proposed synthesizing technique is a combination of the Rough-Fuzzy technique, where Rough Set Theory is used to segregate the precise or certain objects from the region of uncertainty, and fuzzy theory is used to explore the boundary region of rough sets for generating synthesized data based on the fuzzy membership values. The rough-fuzzy-based data sampling method is described in Algorithm 2.

4. Result and Discussion

The collected imbalanced dataset is preprocessed, discretized, and essential features are selected by the RST-based feature selection algorithm. Then the modified dataset is partitioned into k-folds (here, k = 10) [56] such that the proportionality of each class of objects in each fold is the same as that in the whole dataset. Then like the k-fold cross-validation technique, k-1 folds are used for training, and the remaining one fold is used for testing the classification models [57]. The training and testing datasets are class imbalanced as the dataset is imbalanced. The training dataset is balanced by the proposed sampling technique, and the proposed model is evaluated by the imbalanced test dataset, i.e., the performance of different classification models is measured by the imbalanced test dataset. So, the generated data are not used for testing the model. Thus, the model is trained and tested k times, and the average performance is considered as the performance of the model. To measure the performance of the proposed method, we consider various traditional machine learning classifiers, such as Naive Bayes, Logistic, Multilayer Perceptron (MLP), SGD, SimpleLogistic, SMO, Voted Perceptron, IBk, KStar, AdaBoost, AttributeSelectedClassifier (ASC), Bagging, ClassificationViaRegression (CVR), Filtered Classifier, IterativeClassifierOptimizer (ICO), LogitBoost, MultiClassClassifier (MCC), MCC Updateable, Random Committee, RandomizableFilteredClassifier (RFC), RandomSubSpace, Decision Table, JRip, PART, Hoeffding Tree, J48, LMT, Random Forest, Random Tree, REPTree. For each classifier, a confusion matrix [58], shown in Table 1, is created for each test dataset, where the terms ‘True Positive’ (

T P

), ‘False Positive’ (

F P

), ‘False Negative’ (

F N

), and ‘True Negative’ (

T N

) have their usual meanings. These terms are used to compute the performance metrics like Accuracy (A), Precision (P), Recall (R), and F-measures (F) of the classifier, as defined in the Equation (10) to Equation (13).

\begin{matrix} A = \frac{T P + T N}{T P + F P + F N + T N} \end{matrix}

(10)

P = \frac{T P}{T P + F P}

(11)

R = \frac{T P}{T P + F N}

(12)

F = \frac{2 . P . R}{P + R}

(13)

In Table 2, we have considered three different types of training sets, namely the Imbalanced Dataset (ID), Balanced using Oversampling (BO) and Balanced with both oversampling and undersampling (BS). Almost all the classification models provide nearly 95% accuracy for the ID dataset. However, if we look at the precision, most models give undefined values, denoted by ‘-’ in the table. This is because TP and FP are 0, which means that the model with comparatively fewer positive samples in the dataset cannot predict the positive class, and all the positive classes are misclassified as negative. Similarly, if we look at the recall, most classifiers provide 0, as the TP value is 0, though the FN value is nonzero. So, the value of the F-measure is also undefined. Hence, we may conclude that even though the accuracy is quite good, the overall performance of the classification models for the imbalanced dataset (ID) is not satisfactory. Hence, there is an urgent need to deal with this problem. That is why we have opted to balance the dataset, again apply the same classifiers, and check for the performances we get for all the evaluating metrics. In the BO dataset, we have only oversampled the minority classes using the synthetic data generation technique, which gives a satisfying result having balanced precision and recall values and proper F-measures. But the accuracy ranges from around 80% to 93%, which is significantly less compared to the original imbalanced dataset. Next, we experiment with the method using the BS dataset, where we have also performed undersampling to remove some majority class samples. The addition of undersampling gives us improved accuracy, which ranges from 77% to 98%. In this approach, the classifier PART has shown decreased accuracy of 77% from 91% in the BO dataset; similarly, Naive Bayes has reduced in accuracy from 82% to 81%. The accuracy has increased from around 8% to 10% in the rest of the models compared to the BO dataset. One more observation we have noticed is that in the Logistic model, although the accuracy has decreased, the model predicts the classes more accurately as the precision and recall values are improved. As a result, we realized that the model works more accurately and effectively after balancing the class observations.

Comparison with Other Methods

From Table 2, it is observed that the proposed Rough-Fuzzy based Synthetic Data Generation (RFSDG) method provides the best result when the RF classifier model is applied to BS (obtained using both oversampling and undersampling technique) dataset. So, we have considered RFSDG + RF as the final proposed model and compared it with some state-of-the-art methods Chawla et al. [11], Lee and Kim [59], Schölkopf et al. [60], Vuttipittayamongkol and Elyan [61], Zhenchuan Li [20], Kokkotis et al. [62], described below.

SMOTE (Chawla et al. [11]): A well-known oversampling method for uneven data files produced novel minority instances by linear interpolation in the middle of the adjoining points to make the classes even. Random Forest classifier was used on the balanced data file.
OSM (Lee and Kim [59]): The support vector machine was modified with fuzzy and KNN algorithm as an Overlap-sensitive margin (OSM) to deal with the uneven and overlying datasets.
OC-SVM (Schölkopf et al. [60]): With the single-class learning method, only minority samples were trained without considering the majority samples. Suitable for severely imbalanced datasets.
NB-Tomek (Vuttipittayamongkol and Elyan [61]): Here, the majority-class elements were removed from the overlapping area and prevented the excess data removal, which could lead to greater information loss.
Hybrid(AE+ANN) (Zhenchuan Li [20]): They had found out the overlapping subset. Since this subset had a low imbalanced ratio, a non-linear classifier was used to distinguish datasets.
Kokkotis et al. [62] developed reliable machine learning (ML) prediction models for stroke disease and coped with a typical severe class imbalance problem. The effectiveness of the proposed ML approach was investigated with well-known classifiers Random Forest(RF), Logistic Regression(LR), Multilayer Perceptron(MLP), XGBoost, Support Vector Machine(SVM), and K-nearest Neighbours(KNN). We have taken the LR model performance for comparison, as it provides the best results.

The proposed method is compared with other methods (in terms of accuracy, precision, recall, and F-measures) using two different real-life datasets consisting of credit card data and Health data, shown in Table 3 and Table 4. The best results obtained are marked in boldface in both the table, and it is observed that the proposed model provides much better performance compared to other models.

The comparative result is also visualized using a bar chart in Figure 2, and it is observed that the proposed method provides the best result for both datasets.

The models are also compared using Receiver Operating Characteristics (ROC) curve [63] to evaluate the proposed model. The False Positive Rate (FPR) is considered along the X axis, and the True Positive Rate (TPR) is along the Y axis in drawing the ROC curve. The TPR and FPR are computed using the Equations (14) and (15).

T P R = \frac{T P}{T P + F N}

(14)

F P R = \frac{F P}{F P + T N}

(15)

The top left corner of the box is the ‘ideal’ point, where the FPR = 0 and TPR = 1, which indicates that the larger the Area Under the Curve (AUC) provides, the better performance of the model. The ROC curves are drawn for all seven models compared using both datasets, as shown in Figure 3. The figure shows that the proposed model provides the best AUC.

5. Conclusions

The imbalanced dataset creates problems in classification which become acute in some cases, as we have seen in Table 2 that though the model provides good accuracy, it may need to perform better for predicting positive samples. To tackle this problem and improve classification quality, we have used rough-fuzzy theory to apply oversampling and undersampling methods. The proposed sampling improves the result, and we get proper precision and recall value with acceptable accuracy. We have generated new data from the minority class and dealt with the overlapping problem, noise, and outliers that we have stated as the research gap. The main demerit of the work is that we need discretized data for the application of the RST. The generalized version of RST applies to the real-valued dataset. There are many generalizations of the RST using arbitrary binary relation [64]. The main objective of the application of this relation is to increase the positive region of the rough sets to achieve more accuracy. For such generalization of the RST, the neighborhood system helps to approximate a rough set in a better way to understand the imperfect knowledge. On the other hand, there are many similarities between topological structures and RST, which motivates many researchers [65] to exploit some topological generalizations, such as infra-topology, supra-topology, and so on. This helps to replace the RST with their topological counterparts for more generalization of the RST. All such generalization of the RST is the future scope of this paper to handle the imperfect knowledge more effectively towards balancing the class imbalance dataset.

In the paper, we have considered a linear combination technique for the oversampling of the data; we may apply many other techniques [66,67] for the same purpose and make a comparative study among them, which is also the future scope of this paper.

Author Contributions

M.N.: Data Curation, Methodology, Writing—original draft; A.K.D.: Formal Analysis, Supervision, Writing—review & editing, Validation; J.N.: Investigation, Writing—review & editing; D.P.: Funding acquisition, Visualization, Validation. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data Availability Statement

We have collected the data from kaggle.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

Priscilla, C.V.; Prabha, D.P. Influence of optimizing xgboost to handle class imbalance in credit card fraud detection. In Proceedings of the 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 20–22 August 2020; pp. 1309–1315. [Google Scholar]
Rousso, R.; Katz, N.; Sharon, G.; Glizerin, Y.; Kosman, E.; Shuster, A. Automatic recognition of oil spills using neural networks and classic image processing. Water 2022, 14, 1127. [Google Scholar] [CrossRef]
Rodda, S.; Erothi, U.S.R. Class imbalance problem in the network intrusion detection systems. In Proceedings of the 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, India, 3–5 March 2016; pp. 2685–2688. [Google Scholar]
Song, Y.; Peng, Y. A MCDM-based evaluation approach for imbalanced classification methods in financial risk prediction. IEEE Access 2019, 7, 84897–84906. [Google Scholar] [CrossRef]
Liu, Y.; Loh, H.T.; Sun, A. Imbalanced text classification: A term weighting approach. Expert Syst. Appl. 2009, 36, 690–701. [Google Scholar] [CrossRef]
Tao, X.; Li, Q.; Guo, W.; Ren, C.; He, Q.; Liu, R.; Zou, J. Adaptive weighted over-sampling for imbalanced datasets based on density peaks clustering with heuristic filtering. Inf. Sci. 2020, 519, 43–73. [Google Scholar] [CrossRef]
Tasci, E.; Zhuge, Y.; Camphausen, K.; Krauze, A.V. Bias and Class Imbalance in Oncologic Data—Towards Inclusive and Transferrable AI in Large Scale Oncology Data Sets. Cancers 2022, 14, 2897. [Google Scholar] [CrossRef]
Vo, M.T.; Vo, A.H.; Nguyen, T.; Sharma, R.; Le, T. Dealing with the class imbalance problem in the detection of fake job descriptions. Comput. Mater. Contin. 2021, 68, 521–535. [Google Scholar] [CrossRef]
Jang, J.; Kim, Y.; Choi, K.; Suh, S. Sequential targeting: A continual learning approach for data imbalance in text classification. Expert Syst. Appl. 2021, 179, 115067. [Google Scholar] [CrossRef]
Liu, Z.; Tang, D.; Cai, Y.; Wang, R.; Chen, F. A hybrid method based on ensemble WELM for handling multi class imbalance in cancer microarray data. Neurocomputing 2017, 266, 641–650. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Ramentol, E.; Caballero, Y.; Bello, R.; Herrera, F. Smote-rs b*: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using smote and rough sets theory. Knowl. Inf. Syst. 2012, 33, 245–265. [Google Scholar] [CrossRef]
Xu, Z.; Shen, D.; Nie, T.; Kou, Y.; Yin, N.; Han, X. A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Inf. Sci. 2021, 572, 574–589. [Google Scholar] [CrossRef]
Srinilta, C.; Kanharattanachai, S. Application of natural neighbor-based algorithm on oversampling smote algorithms. In Proceedings of the 2021 7th International Conference on Engineering, Applied Sciences and Technology (ICEAST), Bangkok, Thailand, 1–3 April 2021; pp. 217–220. [Google Scholar]
Mishra, P.; Biancolillo, A.; Roger, J.M.; Marini, F.; Rutledge, D.N. New data preprocessing trends based on ensemble of multiple preprocessing techniques. TrAC Trends Anal. Chem. 2020, 132, 116045. [Google Scholar] [CrossRef]
Hasib, K.M.; Iqbal, M.; Shah, F.M.; Mahmud, J.A.; Popel, M.H.; Showrov, M.; Hossain, I.; Ahmed, S.; Rahman, O. A survey of methods for managing the classification and solution of data imbalance problem. arXiv 2020, arXiv:2012.11870. [Google Scholar]
Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef] [Green Version]
Sharma, A.; Singh, P.K.; Chandra, R. SMOTified-GAN for class imbalanced pattern classification problems. IEEE Access 2022, 10, 30655–30665. [Google Scholar] [CrossRef]
Srinivasan, R. Sentimental analysis from imbalanced code-mixed data using machine learning approaches. Distrib. Parallel Databases 2021, 41, 1573–7578. [Google Scholar] [CrossRef]
Li, Z.; Huang, M.; Liu, G.; Jiang, C. A hybrid method with dynamic weighted entropy for handling the problem of class imbalance with overlap in credit card fraud detection. Expert Syst. Appl. 2021, 175, 114750. [Google Scholar] [CrossRef]
Lee, J.; Park, K. GAN-based imbalanced data intrusion detection system. Pers. Ubiquitous Comput. 2021, 25, 121–128. [Google Scholar] [CrossRef]
Banerjee, A.; Bhattacharjee, M.; Ghosh, K.; Chatterjee, S. Synthetic minority oversampling in addressing imbalanced sarcasm detection in social media. Multimed. Tools Appl. 2020, 79, 35995–36031. [Google Scholar] [CrossRef]
Shafqat, W.; Byun, Y.C. A Hybrid GAN-Based Approach to Solve Imbalanced Data Problem in Recommendation Systems. IEEE Access 2022, 10, 11036–11047. [Google Scholar] [CrossRef]
Yafooz, W.M.; Alsaeedi, A. Sentimental Analysis on Health-Related Information with Improving Model Performance using Machine Learning. J. Comput. Sci. 2021, 17, 112–122. [Google Scholar] [CrossRef]
Suh, S.; Lee, H.; Lukowicz, P.; Lee, Y.O. CEGAN: Classification Enhancement Generative Adversarial Networks for unraveling data imbalance problems. Neural Netw. 2021, 133, 69–86. [Google Scholar] [CrossRef]
Imran, A.S.; Yang, R.; Kastrati, Z.; Daudpota, S.M.; Shaikh, S. The impact of synthetic text generation for sentiment analysis using GAN based models. Egypt. Inform. J. 2022, 23, 547–557. [Google Scholar] [CrossRef]
Mollas, I.; Chrysopoulou, Z.; Karlos, S.; Tsoumakas, G. ETHOS: A multi-label hate speech detection dataset. Complex Intell. Syst. 2022, 8, 4663–4678. [Google Scholar] [CrossRef]
Chen, H.; Li, T.; Fan, X.; Luo, C. Feature selection for imbalanced data based on neighborhood rough sets. Inf. Sci. 2019, 483, 1–20. [Google Scholar] [CrossRef]
Zhang, C.; Bi, J.; Xu, S.; Ramentol, E.; Fan, G.; Qiao, B.; Fujita, H. Multi-imbalance: An open-source software for multi-class imbalance learning. Knowl.-Based Syst. 2019, 174, 137–143. [Google Scholar] [CrossRef]
Behmanesh, M.; Adibi, P.; Karshenas, H. Weighted least squares twin support vector machine with fuzzy rough set theory for imbalanced data classification. arXiv 2021, arXiv:2105.01198. [Google Scholar]
Saha, A.; Reddy, J.; Kumar, R. A fuzzy similarity based classification with Archimedean-Dombi aggregation operator. J. Intell Manag. Decis. 2022, 1, 118–127. [Google Scholar] [CrossRef]
Elreedy, D.; Atiya, A.F. A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf. Sci. 2019, 505, 32–64. [Google Scholar] [CrossRef]
Wei, J.; Huang, H.; Yao, L.; Hu, Y.; Fan, Q.; Huang, D. NI-MWMOTE: An improving noise-immunity majority weighted minority oversampling technique for imbalanced classification problems. Expert Syst. Appl. 2020, 158, 113504. [Google Scholar] [CrossRef]
Das, B.; Krishnan, N.C.; Cook, D.J. RACOG and wRACOG: Two probabilistic oversampling techniques. IEEE Trans. Knowl. Data Eng. 2014, 27, 222–234. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shelke, M.S.; Deshmukh, P.R.; Shandilya, V.K. A review on imbalanced data handling using undersampling and oversampling technique. Int. J. Recent Trends Eng. Res. 2017, 3, 444–449. [Google Scholar]
Muntasir Nishat, M.; Faisal, F.; Jahan Ratul, I.; Al-Monsur, A.; Ar-Rafi, A.M.; Nasrullah, S.M.; Reza, M.T.; Khan, M.R.H. A comprehensive investigation of the performances of different machine learning classifiers with SMOTE-ENN oversampling technique and hyperparameter optimization for imbalanced heart failure dataset. Sci. Program. 2022, 2022, 3649406. [Google Scholar] [CrossRef]
Swana, E.F.; Doorsamy, W.; Bokoro, P. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors 2022, 22, 3246. [Google Scholar] [CrossRef] [PubMed]
Liu, G.; Yang, Y.; Li, B. Fuzzy rule-based oversampling technique for imbalanced and incomplete data learning. Knowl.-Based Syst. 2018, 158, 154–174. [Google Scholar] [CrossRef]
Ren, R.; Yang, Y.; Sun, L. Oversampling technique based on fuzzy representativeness difference for classifying imbalanced data. Appl. Intell. 2020, 50, 2465–2487. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
Yu, H.; Yang, X.; Zheng, S.; Sun, C. Active learning from imbalanced data: A solution of online weighted extreme learning machine. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 1088–1103. [Google Scholar] [CrossRef]
Seiffert, C.; Khoshgoftaar, T.M.; Van Hulse, J.; Napolitano, A. RUSBoost: Improving classification performance when training data is skewed. In Proceedings of the 2008 19th International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
Kazerouni, A.; Zhao, Q.; Xie, J.; Tata, S.; Najork, M. Active learning for skewed data sets. arXiv 2020, arXiv:2005.11442. [Google Scholar]
Qu, W.; Yan, D.; Sang, Y.; Liang, H.; Kitsuregawa, M.; Li, K. A novel Chi2 algorithm for discretization of continuous attributes. In Proceedings of the Progress in WWW Research and Development: 10th Asia-Pacific Web Conference, APWeb 2008, Shenyang, China, 26–28 April 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 560–571. [Google Scholar]
Lavangnananda, K.; Chattanachot, S. Study of discretization methods in classification. In Proceedings of the 2017 9th International Conference on Knowledge and Smart Technology (KST), Chonburi, Thailan, 1–4 February 2017; pp. 50–55. [Google Scholar]
Das, A.K.; Chakrabarty, S.; Pati, S.K.; Sahaji, A.H. Applying restrained genetic algorithm for attribute reduction using attribute dependency and discernibility matrix. In Proceedings of the Wireless Networks and Computational Intelligence: 6th International Conference on Information Processing, ICIP, Bangalore, India, 10–12 August 2012; Springer: Berlin/Heidelberg, Germany, 2012; pp. 299–308. [Google Scholar]
Kumar, V.; Minz, S. Feature selection: A literature review. SmartCR 2014, 4, 211–229. [Google Scholar] [CrossRef]
Basu, S.; Das, S.; Ghatak, S.; Das, A.K. Strength pareto evolutionary algorithm based gene subset selection. In Proceedings of the 2017 International Conference on Big Data Analytics and Computational Intelligence (ICBDAC), Andhra Pradesh, India, 23–25 March 2017; pp. 79–85. [Google Scholar]
Renigier-Biłozor, M.; Janowski, A.; d’Amato, M. Automated valuation model based on fuzzy and rough set theory for real estate market with insufficient source data. Land Use Policy 2019, 87, 104021. [Google Scholar] [CrossRef]
Yang, X.; Chen, H.; Li, T.; Luo, C. A noise-aware fuzzy rough set approach for feature selection. Knowl.-Based Syst. 2022, 250, 109092. [Google Scholar] [CrossRef]
Qiu, Z.; Zhao, H. A fuzzy rough set approach to hierarchical feature selection based on Hausdorff distance. Appl. Intell. 2022, 52, 11089–11102. [Google Scholar] [CrossRef]
Sengupta, S.; Das, A.K. A study on rough set theory based dynamic reduct for classification system optimization. Int. J. Artif. Intell. Appl. 2014, 5, 35. [Google Scholar] [CrossRef]
Liu, Y.; Jiang, Y.; Yang, J. Feature reduction with inconsistency. Int. J. Cogn. Informatics Nat. Intell. IJCINI 2010, 4, 77–87. [Google Scholar] [CrossRef] [Green Version]
Ruspini, E.H.; Bezdek, J.C.; Keller, J.M. Fuzzy clustering: A historical perspective. IEEE Comput. Intell. Mag. 2019, 14, 45–55. [Google Scholar] [CrossRef]
Ding, W.; Chakraborty, S.; Mali, K.; Chatterjee, S.; Nayak, J.; Das, A.K.; Banerjee, S. An unsupervised fuzzy clustering approach for early screening of COVID-19 from radiological images. IEEE Trans. Fuzzy Syst. 2021, 30, 2902–2914. [Google Scholar] [CrossRef]
Marcot, B.G.; Hanea, A.M. What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis? Comput. Stat. 2021, 36, 2009–2031. [Google Scholar] [CrossRef]
Yadav, S.; Shukla, S. Analysis of k-fold cross-validation over hold-out validation on colossal datasets for quality classification. In Proceedings of the 2016 IEEE 6th International Conference on Advanced Computing (IACC), Bhimavaram, India, 27–28 February 2016; pp. 78–83. [Google Scholar]
Caelen, O. A Bayesian interpretation of the confusion matrix. Ann. Math. Artif. Intell. 2017, 81, 429–450. [Google Scholar] [CrossRef]
Lee, H.K.; Kim, S.B. An overlap-sensitive margin classifier for imbalanced and overlapping data. Expert Syst. Appl. 2018, 98, 72–83. [Google Scholar] [CrossRef]
Schölkopf, B.; Platt, J.C.; Shawe-Taylor, J.; Smola, A.J.; Williamson, R.C. Estimating the support of a high-dimensional distribution. Neural Comput. 2001, 13, 1443–1471. [Google Scholar] [CrossRef] [PubMed]
Vuttipittayamongkol, P.; Elyan, E. Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf. Sci. 2020, 509, 47–70. [Google Scholar] [CrossRef]
Kokkotis, C.; Giarmatzis, G.; Giannakou, E.; Moustakidis, S.; Tsatalas, T.; Tsiptsios, D.; Vadikolias, K.; Aggelousis, N. An Explainable Machine Learning Pipeline for Stroke Prediction on Imbalanced Data. Diagnostics 2022, 12, 2392. [Google Scholar] [CrossRef] [PubMed]
Hoo, Z.H.; Candlish, J.; Teare, D. What is an ROC curve? Emerg. Med. J. 2017, 34, 357–359. [Google Scholar] [CrossRef] [Green Version]
Al-shami, T.M. An improvement of rough sets’ accuracy measure using containment neighborhoods with a medical application. Inf. Sci. 2021, 569, 110–124. [Google Scholar] [CrossRef]
Al-Shami, T.M.; Alshammari, I. Rough sets models inspired by supra-topology structures. Artif. Intell. Rev. 2022, 1–29. [Google Scholar] [CrossRef] [PubMed]
Szlobodnyik, G.; Farkas, L. Data augmentation by guided deep interpolation. Appl. Soft Comput. 2021, 111, 107680. [Google Scholar] [CrossRef]
Bayer, M.; Kaufhold, M.A.; Reuter, C. A survey on data augmentation for text classification. ACM Comput. Surv. 2022, 55, 1–39. [Google Scholar] [CrossRef]

Figure 1. The WorkFlow of the proposed methodology.

Figure 2. Comparison of the performance of different methods.

Figure 3. Comparison of ROC curves.

Table 1. The Confusion Matrix.

Actual	Predicted
	Positive	Negative
Positive	True Positive	False Negative
Negative	False Positive	True Negative

Table 2. Performance evaluation using different classifiers and different datasets.

	ID (in %)				BO (in %)				BS (in %)
Classifier	A	P	R	F	A	P	R	F	A	P	R	F
Naive Bayes	94	21	19	20	82	99	63	77	81	99	59	74
Logistic	95	-	0	-	82	83	79	81	93	91	95	93
MLP	94	16	04	07	87	92	80	86	98	99	95	97
SGD	95	-	0	-	82	81	80	81	96	96	95	95
SimpleLogistic	95	-	0	-	82	83	79	81	94	91	95	93
SMO	95	-	0	-	81	81	79	80	95	94	95	95
Voted Perceptron	95	-	0	-	79	76	80	78	95	97	93	95
IBk	92	14	13	14	92	88	97	92	96	97	95	96
KStar	95	15	01	03	90	89	91	90	97	99	95	97
AdaBoost	95	-	0	-	84	84	83	83	95	94	96	95
ASC	95	-	0	-	83	98	64	78	93	90	95	92
Bagging	95	-	0	-	92	92	91	92	98	99	95	97
CVR	95	-	0	-	90	90	88	89	98	99	95	97
FilteredClassifier	95	-	0	-	89	91	86	88	98	99	95	97
ICO	95	-	0	-	87	89	82	85	97	98	95	97
LogitBoost	95	-	0	-	87	89	82	85	97	98	95	97
MCC	95	-	0	-	82	83	79	81	93	91	95	93
MCC Updateable	95	-	0	-	82	81	80	81	96	96	95	95
Random Committee	93	11	07	09	93	89	96	92	97	97	95	96
RFC	93	07	05	06	92	88	96	92	96	96	96	96
RandomSubSpace	95	-	0	-	88	96	77	85	98	99	95	97
Decision Table	95	-	0	-	88	91	81	86	97	99	95	97
JRip	93	23	02	03	91	91	90	90	98	99	95	97
PART	95	18	03	06	91	88	93	91	97	99	95	97
Decision Stump	95	-	0	-	80	80	80	80	77	79	78	77
HoeffdingTree	95	-	0	-	80	80	80	80	96	97	95	96
J48	95	-	0	-	92	91	93	92	98	99	95	97
LMT	95	-	0	-	92	90	93	91	98	99	95	97
Random Forest	94	10	03	05	92	89	95	92	97	99	95	97
Random Tree	93	15	13	14	92	88	96	92	96	96	96	96
REPTree	95	14	01	01	91	90	91	90	98	99	95	97

Table 3. Comparison (in %) of different methods on Credit-card data.

Model	Method Used	P	R	F	A
Chawla et al. [11]	SMOTE(Naive Bayes )	95	61	74	81
Lee and Kim [59]	OSM(KNN)	81	77	79	80
Schölkopf et al. [60]	OC SVM	85	89	87	88
Vuttipittayamongkol and Elyan [61]	NB-Tomek(SVM)	86	85	86	86
Zhenchuan Li [20]	Hybrid(AE+RF)	72	72	72	74
Kokkotis et al. [62]	LR model	77	80	79	85
RFSDG	Rough-Fuzzy+RF	98	96	97	99

Table 4. Comparison (in %) of different methods on Health data.

Model	Method Used	P	R	F	A
Chawla et al. [11]	SMOTE(Naive Bayes )	71	68	69	71
Lee and Kim [59]	OSM(KNN)	79	77	78	78
Schölkopf et al. [60]	OC SVM	75	78	77	78
Vuttipittayamongkol and Elyan [61]	NB-Tomek(SVM)	79	75	77	78
Zhenchuan Li [20]	Hybrid(AE+RF)	60	64	62	64
Kokkotis et al. [62]	LR model	77	78	78	74
RFSDG	Rough-Fuzzy+RF	99	95	97	98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Naushin, M.; Das, A.K.; Nayak, J.; Pelusi, D. Rough-Fuzzy Based Synthetic Data Generation Exploring Boundary Region of Rough Sets to Handle Class Imbalance Problem. Axioms 2023, 12, 345. https://doi.org/10.3390/axioms12040345

AMA Style

Naushin M, Das AK, Nayak J, Pelusi D. Rough-Fuzzy Based Synthetic Data Generation Exploring Boundary Region of Rough Sets to Handle Class Imbalance Problem. Axioms. 2023; 12(4):345. https://doi.org/10.3390/axioms12040345

Chicago/Turabian Style

Naushin, Mehwish, Asit Kumar Das, Janmenjoy Nayak, and Danilo Pelusi. 2023. "Rough-Fuzzy Based Synthetic Data Generation Exploring Boundary Region of Rough Sets to Handle Class Imbalance Problem" Axioms 12, no. 4: 345. https://doi.org/10.3390/axioms12040345

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rough-Fuzzy Based Synthetic Data Generation Exploring Boundary Region of Rough Sets to Handle Class Imbalance Problem

Abstract

1. Introduction

1.1. Literature Survey

1.2. Objective

1.3. Contribution

1.4. Summary

2. Preprocessing and Feature Selection

2.1. Rough Set Theory

2.2. Feature Selection

3. Rough-Fuzzy Based Oversampling Technique

4. Result and Discussion

Comparison with Other Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI