1. Introduction
Technological advances have enabled the private and public sectors to generate and collect vast amounts of data from various sources. Social media interactions, transactional data, geospatial information, biometric identifiers, and governmental databases are examples of where these data originate from. According to the 11th edition of ‘Data Never Sleeps’ 241 M emails were sent every minute in 2023 [
1]. Consequently, the volume of raw data may reach zettabytes (ZB) in a variety of formats, such as text, audio, image, and video.
The velocity at which data are generated, processed, and analyzed adds another dimension to this data explosion. The International Data Corporation (IDC) and Seagate estimate that by 2025, approximately 163 ZB of data will have been generated, indicating that the volume may have grown tenfold in the last eight years [
2]. Therefore, data can be characterized by their volume, variety, and velocity, which is also called the three Vs [
3], which define the phenomenon known as Big Data [
4].
Analyzing Big Data represents a significant opportunity for companies, governments, and society to extract meaningful insights. This process can help us to gain a competitive edge, enhance decision-making, and drive innovation [
3,
5,
6]. Therefore, Big Data can also be described by its value in obtaining benefits and insights from the analyzed raw data, adding a fourth V-characteristic. In this sense, machine learning and artificial communities have identified a niche opportunity to develop techniques that can discover hidden patterns and predict future values or trends from large amounts of complex data from sectors such as healthcare, cybersecurity, finance, marketing, manufacturing, and smart grids [
7,
8,
9,
10,
11].
In 2021, Andrew Ng stated that much of the effort in deploying machine learning models has been focused on optimizing or creating new algorithms [
12]. This model-centric approach ignores the importance of data quality in achieving the most effective machine learning systems [
13]. The veracity of data has been recognized as a critical property for obtaining reliable and actionable insights from large datasets. This fifth V-characteristic deals with ensuring the trustworthiness of captured data, where low data veracity caused by a number of issues, such as data redundancy, inconsistency, and noise, may cause machine learning models to yield inaccurate predictions, biased results, and reduced model performance. As a response to addressing data veracity rather than solely concentrating on the algorithms themselves, it has been proposed that the development and application of machine learning models should be with a data-centric approach, aiming to improve the models’ performance using high-quality data obtained from data preprocessing techniques that address several data issues.
One of the data inconsistencies leading to inaccurate behavior in classification models is the significant difference in sample sizes per class in the training dataset. This type of bias, where the classes are not equally distributed, is well-known as the class imbalance problem, which causes trained algorithms to favor the predominant class in their predictions [
14,
15]. This issue becomes severe when minority classes are of major interest, which means the cost of incorrectly labeling an example of an underrepresented class is very high [
16]. Therefore, the data preprocessing step is paramount for transforming raw Big Data through data cleaning, extraction, and transformation, resulting in reduced and cleaned data, also known as smart data [
7,
9,
17].
Solutions to the class imbalance problem can be categorized into two large groups called algorithm-level and data-level approaches. Algorithm-level methods modify existing algorithms to learn from imbalanced datasets. In contrast, data-level solutions focus on balancing the training dataset by either reducing (under-sampling) or increasing (oversampling) the sample sizes of different classes. Although data imbalance has traditionally been seen as the main cause of the poor performance of the classifiers, several studies have shown that other data irregularities, such as class overlap [
18] and high dimensionality [
19], can also exacerbate the problem, introducing a unique set of challenges. Consequently, a third research line has emerged, combining the strengths of both data-level and algorithm-level techniques.
The class imbalance problem has attracted much attention for many years, leading to the development of numerous solutions addressing biased learning. Data-level solutions have been the most exploited of the three research lines because the proposals can be applied to various problems across different domains. For example, in 2002, Chawla et al. [
20] introduced an oversampling technique called the Synthetic Minority Oversampling Technique (SMOTE) to balance datasets by generating synthetic examples of the minority class through interpolation between selected minority instances and its nearest neighbors. By 2019, Kovács [
21] had documented approximately 85 variants of SMOTE. In the case of under-sampling techniques for selecting instances of the majority class to be removed, the community has adopted the strategy to filter those examples that can be considered noisy, redundant, or borderline. The simplest method to identify these data types is to analyze the local distribution of the data by computing the k-nearest neighbors [
22].
Although the class imbalance problem has been effectively addressed in standard scenarios, Big Data introduces additional challenges where some conventional assumptions no longer hold, making traditional data preprocessing methods infeasible [
7]. Therefore, data-level solutions must consider the specialized infrastructure to process large datasets in parallel or distributed systems such as Apache Spark and new programming paradigms like Scala [
23]. However, it is possible to find strategies in the literature that have been adapted to be scalable solutions. For example, Basgall et al. [
23] proposed SMOTE-Big Data (SMOTE-BD), an implementation in Apache Spark. This oversampling technique is based on an innovative distributed k-nearest neighbor model named kNN-IS that enhances the runtime efficiency of the nearest neighbor search [
24].
In Big Data environments, similar to standard problems, data-level solutions are based on the nearest neighbor rule, typically employing metrics such as the Euclidean distance to assess the similarity between one example and another. However, computing distances between data samples in high-dimensional datasets can make them appear nearly identical, thereby complicating the differentiation between samples [
25,
26]. A dataset with a large number of features relative to the number of samples can lead to the phenomenon known as the curse of dimensionality. In high-dimensional spaces, the volume of the space increases exponentially with the number of dimensions, leading to data sparsity. Therefore, the Euclidean distance loses its discriminatory power, impacting the effectiveness of learning algorithms because distances between instances become less informative [
27]. In this paper, we hypothesize that the use of fractional norms, norm-p, can alleviate the effect of the high dimensionality in a well-known data-level solution. In particular, we employ a dissimilarity approach to mapping the dataset into a lower dimensionality, in which the dimensions are defined by vectors measuring pairwise dissimilarities between two examples, commonly given by the Euclidean distance [
28,
29].
We note that data-centric solutions for class imbalance in Big Data scenarios have been less explored. Additionally, some approaches address data issues in an isolated manner. Therefore, we propose a scalable hybrid approach that combines renowned data-level solutions based on distance metrics. Initially, the dissimilarity approach maps the original feature space representation into a low-dimensional dissimilarity space using a fractional norm. In this transformed space, represented by dissimilarity vectors, we then apply SMOTE to achieve class balance.
We conducted a comprehensive experimental study on nine imbalanced Big Data datasets with high dimensionality to evaluate the proposal’s performance. The datasets were adapted to two-class problems, comprising 24,832 features and 21,025 instances. The decision tree employed was taken from Spark’s MLib toolkit. The proposal was compared against the application of SMOTE in the original space and in scenarios without any balancing technique. A nonparametric statistical test was used in our analysis to determine whether our proposal statistically outperformed the other methods. In summary, the contributions of this paper are as follows:
We address the problem of class imbalance in the presence of high dimensionality and class overlap.
We explore the suitability of fractional norms as an alternative to the problem of Euclidean distance in high-dimensional datasets.
We employ a dissimilarity-based representation to mitigate high dimensionality and class overlap issues.
We implemented a hybrid approach in Spark.
The rest of the paper is organized as follows.
Section 2 describes the methods and techniques used in this work.
Section 3 presents the experimental setup.
Section 4 discusses the results. Finally,
Section 5 remarks on the main findings and outlines further research.
3. Experimental Setup
For the experiments presented in this paper, we created an artificial dataset from an original dataset that contains 21,025 instances, 24,832 features, and 17 numerically labeled classes from 0 to 16.
The original dataset, known as the Indian pine scene, was obtained from the Computational Intelligence Group site [
40]. This dataset was collected by NASA’s AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) sensor, which provides 224 bands and a spectral resolution of 10.0 nm [
41]. It consists of one-third of the forest area, and the rest is the cultivation field or residual natural vegetation indices.
Table 1 depicts the ground truth classes and their corresponding instances for the Indian pine scene.
Although datasets with high dimensionality are available in several repositories, some have characteristics that fall outside the article’s scope, such as a small number of instances or sparse data. Inspired by the works of Rendón et al. [
42] and Charte et al. [
43], which demonstrated how a neural network can be used for transforming data in order to find other features, we employed a Multi-layer Perceptron (MLP) model that was trained using the full dataset. Following the parameter specification in [
42], we built the MLP and extracted from the output of a hidden layer the newly transformed dataset with a high dimensionality of 24,832 features. Due to the dataset size, it will be available upon request.
Since we are interested in two-class imbalance datasets, we joined several classes to form the majority class, leaving a single class as the minority class. During this process, we omitted classes that constitute less than
of the dataset (i.e., Alfalfa, Grass-pasture-mowed, Oats, and Stone-Steel-Towers). This resulted in a total of 9 imbalanced datasets with several imbalance ratios (IR).
Table 2 shows all the characteristics of the datasets, where the number that appears in the dataset name denotes the class chosen as the minority class.
All the datasets were partitioned using 70% of instances for training and the rest 30% for testing. We used the Decision Tree (DT) classifier from the Spark MLlib library. The DT classifier was chosen with the GINI impurity measure and configured with a maximum depth of 5 and 32 bins.
The fractional values of p used were 0.25, 0.33, 0.50, 0.66, and 0.75. This choice was based on considerations from published works. For the dissimilarity representation, the selected sizes for R included 10, 50, 100, 200, 500, 1000, and 2000 instances. In each selection, R was composed, ensuring an equitable representation of examples from both the majority and minority classes.
The Big Data architecture was developed on Google Cloud using the credit grant obtained from the “Google for Education” program [
44]. The Spark 3.1.1 cluster was configured with one master node equipped with 32 GB of memory and 8 vCPUs, and three slave nodes, each with 64 GB of memory and 16 vCPUs.
4. Results and Discussion
The experiments were conducted to evaluate three strategies for dealing with imbalanced and high-dimensional Big Data datasets: (1) SMOTE, (2) dissimilarity with fractional norms (), and (3) the joint use of dissimilarity and SMOTE (+ SMOTE). For clarity and simplicity, in strategies (2) and (3), we only selected the two best results in terms of AUC and G-mean.
Table 4 presents the experimental results for each strategy used on the nine datasets. It includes the values of
p in the Minkowsky distance metric used in constructing the dissimilarity matrix and the resulting dimensionality (number of features) obtained when applying this mapping. Additionally, the classification results on the original imbalanced dataset and with SMOTE are provided for comparison purposes.
The experimental results using SMOTE align well with the general conclusions from numerous studies on standard problems, indicating the effectiveness of dealing with the class imbalance problem. Despite the high dimensionality of the data, the use of SMOTE consistently improves both the AUC and G-mean across all cases when compared with the baseline strategy (without the preprocessing technique). However, this enhancement comes at the cost of increasing the size of the dataset.
When the imbalanced dataset is transformed into a dissimilarity representation with fractional norms, the results generally exhibit lower performance than those obtained with the original dataset. However, it is important to note that the dataset continues presenting the class imbalance problem in this new representation. Despite this, the decrease in performance is not always drastic, even though the dimensionality is reduced by 92% and 98% in some cases. Similarly, while it is challenging to recommend a specific value of p, in most situations, better results are observed when .
The proposed approach of combining dissimilarity with fractional norms followed by SMOTE (+SMOTE) demonstrated the most promising performance in terms of both AUC and G-mean. Notably, this improvement was achieved while maintaining a lower dimensionality compared to the original datasets. It demonstrates that integrating these techniques with can synergistically enhance the model’s performance on the minority class.
In order to determine whether there are significant differences in the results, we used a nonparametric statistical test called the Friedman test. Additionally, a post-doc test was conducted using Bonferrini–Dunn’s method to determine which algorithms perform better, equally, or worse.
Table 5 summarizes the average ranks of each strategy based on the Friedman test, where the strategy with the best performance obtains the lowest rank, and the worst-performing approach receives the highest rank [
48].
According to the results, +SMOTE ranked the best overall, with an average rank of 1.1, indicating that it consistently outperformed the other strategies across the different datasets. SMOTE was ranked second with an average rank of 2. The imbalanced strategy ranked third, and finally, obtained the highest average rank, indicating that it was the worst-performing strategy. Based on the Iman and Davenport statistic , distributed according to the F distribution with four algorithms and nine datasets, yielding a value of , and a computed p-value of , we reject the null hypothesis at the significance level of . This indicates that there are significant differences among the algorithms tested.
Further analysis using the Bonferroni–Dunn’s procedure as a post hoc method is presented in
Table 6, showing the
p-values obtained over the results of the Friedman test with
, using
+SMOTE as the control algorithm. With a
p-value of
, we reject the null hypothesis for the following pairs, indicating that the control algorithm is superior:
+SMOTE vs. Ddp and
+SMOTE vs. Imbalanced. However, there is no significant difference between
+SMOTE vs. SMOTE, suggesting that they perform similarly well.