Imbalanced Data Parameter Optimization of Convolutional Neural Networks Based on Analysis of Variance

Zou, Ruiao; Wang, Nan

doi:10.3390/app14199071

Open AccessArticle

Imbalanced Data Parameter Optimization of Convolutional Neural Networks Based on Analysis of Variance

by

Ruiao Zou

and

Nan Wang

^*

School of Mathematical Science, Heilongjiang University, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 9071; https://doi.org/10.3390/app14199071

Submission received: 27 August 2024 / Revised: 25 September 2024 / Accepted: 4 October 2024 / Published: 8 October 2024

(This article belongs to the Special Issue Motion Control for Robots and Automation)

Download

Browse Figures

Versions Notes

Abstract

:

Classifying imbalanced data is important due to the significant practical value of accurately categorizing minority class samples, garnering considerable interest in many scientific domains. This study primarily uses analysis of variance (ANOVA) to investigate the main and interaction effects of different parameters on imbalanced data, aiming to optimize convolutional neural network (CNN) parameters to improve minority class sample recognition. The CIFAR-10 and Fashion-MNIST datasets are used to extract samples with imbalance ratios of 25:1, 15:1, and 1:1. To thoroughly assess model performance on imbalanced data, we employ various evaluation metrics, such as accuracy, recall, F1 score, P-mean, and G-mean. In highly imbalanced datasets, optimizing the learning rate significantly affects all performance metrics. The interaction between the learning rate and kernel size significantly impacts minority class samples in moderately imbalanced datasets. Through parameter optimization, the accuracy of the CNN model on the 25:1 highly imbalanced CIFAR-10 and Fashion-MNIST datasets improves by 14.20% and 5.19% compared to the default model and by 8.21% and 3.87% compared to the undersampling model, respectively, while also enhancing other evaluation metrics for minority classes.

Keywords:

imbalanced data; analysis of variance; parameter optimization; fractional factorial design

1. Introduction

In the current era, which is dominated by data, deep learning techniques, especially CNNs (convolutional neural networks), have achieved significant success in various fields, including but not limited to facial and image recognition [1] and medical imaging [2]. Despite the widespread application and success of CNNs in various domains, the research on CNN-based models for imbalanced data classification is not well developed. However, the issue of data imbalance is very common in real-life scenarios, such as in medical diagnosis [3], fraud detection [4], and fault detection [5].

For imbalanced data, many traditional methods are available [6], including undersampling [7], oversampling [8], cost-sensitive learning [9], and ensemble learning [10]. These approaches enhance the classification performance of minority class samples by rebalancing the data, using ensemble models, or modifying the classifier’s cost function. First, the Synthetic Minority Oversampling Technique (SMOTE) employs hybrid sampling to balance the sample distribution in the dataset by generating new minority class samples, achieving favorable results [11]. This method enhances the model’s ability to recognize minority classes by synthesizing new minority class samples, allowing the model to focus more on these classes during training. Secondly, cost-sensitive methods improve the classification performance on imbalanced samples by introducing a weighted cost function in the CNN [12]. This approach increases the loss cost for small samples, thereby enhancing the model’s learning of minority class features and reducing the bias towards majority class samples. Finally, ensemble learning methods, which combine multiple classifiers, have shown significant effectiveness in addressing imbalanced sample classification problems [10]. Techniques such as bagging [13] and boosting [14] aggregate multiple weak classifiers into a strong classifier, effectively improving the model’s recognition ability for minority class samples.

The performance and accuracy of CNNs are primarily influenced by network architecture and parameter selection. The LeNet-5 network architecture, developed by LeCun et al., marked the beginning of deep learning research [15]. Subsequently, the well-known AlexNet architecture, introduced by Krizhevsky et al., further advanced the field of deep learning [1]. The VGG network, created by Simonyan and Zisserman, and the ResNet network developed by He et al. are among the most widely used and best-performing networks to date [16,17]. In terms of parameter selection, researchers have made significant efforts. However, parameter selection for imbalanced data has not been extensively studied. This is primarily because there are many parameters that affect the classification performance of imbalanced data, and many of these parameters interact with each other. For example, learning rate, batch size, and regularization parameters can all have a significant impact on model performance [18]. Moreover, there is a diversity in the evaluation metrics for classification performance on imbalanced data. Unlike balanced data, metrics such as accuracy, ROC, and AUC do not comprehensively assess the performance of imbalanced classification problems [8,19]. In practical scenarios, metrics like recall for the minority class, G-mean, and F1 score might be of greater concern. For example, in disease diagnosis, where the number of patients with the disease can be small, whether 100% of the diseased individuals can be identified is often of utmost importance to doctors. In the face of the aforementioned complex situations, simple parameter-tuning methods often cannot meet the needs of parameter selection for imbalanced samples. Therefore, this paper proposes an experimental design and parameter selection method based on factor analysis, which systematically analyzes the impact of parameters on model performance.

Fractional factorial design [20] is a widely used statistical method in experimental design that is primarily employed to evaluate the impact of multiple factors and their interactions on experimental outcomes. By systematically varying the levels of multiple factors to study their effects on the response variable, fractional factorial design can effectively identify the main effects and interactions of each factor. The references provide a comprehensive and systematic approach for model parameter optimization [21,22].

This study proposes a systematic experimental method based on factorial design to improve the performance of the CNN model on imbalanced data. Our research uses analysis of variance (ANOVA) to explore the main effects and interaction effects of parameters such as the learning rate, dropout rate, and kernel size in handling imbalanced data, with the aim of identifying the optimal configuration for imbalanced datasets. The experimental results demonstrate that the optimized CNN model significantly improves the classification performance of minority class samples when handling 25:1 imbalanced datasets extracted from CIFAR-10 and Fashion-MNIST.

In this context, we summarize our contributions as follows:

We combine domain expertise in a CNN with parameter optimization, first using fractional factorial design and ANOVA to evaluate the significant relationships between model parameters and evaluation metrics for minority class samples, such as accuracy, recall, P-mean, and G-mean. We attempt to use ANOVA to analyze whether these parameters exhibit main effects and interaction effects.
Based on the ANOVA results, we identify and optimize the parameters that significantly impact the imbalanced data. The optimized model is compared with the undersampling method for various evaluation metrics for minority class classification in imbalanced data, thereby confirming the importance of parameter selection in handling imbalanced data.

2. Related Work

Methods for imbalanced data classification are typically categorized into two main types of approach: data-processing techniques and algorithm/model-oriented methods. These studies have laid a solid theoretical foundation for extending deep learning methods to imbalanced data classification.

In terms of data-processing techniques, the main approaches are resampling and undersampling techniques [23,24,25]. These methods are designed to equalize the number of samples in different classes or to minimize noise, thereby improving the model’s performance when dealing with imbalanced data. Undersampling eliminates data, thereby decreasing the total amount of information available for the model to learn. Conversely, oversampling boosts the number of minority class samples, which enlarges the training dataset and can lead to overfitting [26]. To address this issue, Zhang and Mani [27] proposed using K-nearest neighbors (K-NN) to calculate the distances between the sample to be classified and the samples in the training set, selecting most samples for removal. Chawla et al. [8] introduced the SMOTE, which creates synthetic samples for the minority class by interpolating between existing minority class samples. Unlike traditional methods that replicate minority class samples, the SMOTE reduces the risk of overfitting and significantly enhances classifier performance on imbalanced datasets.

In contrast to data-sampling methods, algorithmic approaches for addressing class imbalance do not modify the distribution of the training data. Instead, they adjust the learning or decision-making process, including cost-sensitive learning and ensemble learning. Cost-sensitive learning primarily involves adjusting learning algorithms based on the classification costs of different classes to effectively handle minority class samples [28]. In 2013, He et al. proposed a cost-sensitive learning method based on Gradient-Boosting Decision Trees, effectively improving the accuracy and robustness of models when dealing with imbalanced data [29]. Additionally, research has explored ways to enhance cost-sensitive learning capabilities by integrating deep learning techniques, such as designing loss functions suitable for deep neural networks or hierarchical cost-sensitive mechanisms [30]. Sun et al. [31] presented three cost-sensitive boosting techniques: AdaC1, AdaC2, and AdaC3. These methods introduce cost terms into the weight updates of AdaBoost, iteratively increasing the influence of the minority classes. Sun demonstrated that, in most cases, cost-sensitive boosting ensemble methods outperformed ordinary boosting methods. Wu et al. [32] combined cost-sensitive learning with Support Vector Machines (SVMs), employing a cost-sensitive SVM algorithm to adjust the margin and induce boundary drift, thereby improving the recall rate of minority classes.

Ensemble learning is a method that improves predictive performance by combining multiple base classifiers. Early ensemble learning methods, such as Random Forest, which combines bagging and decision trees, emphasize the importance of feature randomness [33]. Gradient-Boosting Machines (GBMs) and their improved versions, such as XGBoost and LightGBM, significantly enhance prediction accuracy through optimized loss functions and enhanced collaboration among learners [34]. Zhong et al. [35] integrated data-level methods with ensemble learning, combining predictions from multiple CNN sub-models to enhance classification performance.

Additionally, there are other methods at the algorithmic research level. For example, Zhou et al. [36] proposed a cumulative learning approach that decomposes class distributions into balanced subsets to address long-tail recognition issues. This method has demonstrated promising outcomes for managing highly imbalanced data. Hensman and Masko [37] studied the impact of training a CNN on multiclass imbalanced data and found that imbalanced data distribution significantly affects CNN performance, particularly with substantial impacts from distributions involving single minority classes and majority classes. These methods have limitations in improving the accuracy of imbalanced data, particularly the accuracy of minority class samples.

Research on parameter optimization specifically for imbalanced data is scarce. Existing studies on parameters have predominantly focused on balanced datasets, such as investigating the number of convolutional kernels and optimizing CNN hyperparameters using genetic algorithms, and methods for exploring the number of convolutional kernels [38]. The primary aim was to find the optimal number to enhance model performance. Using genetic algorithms [39] for CNN hyperparameter optimization is an automated approach for tuning hyperparameters. Genetic algorithms simulate natural selection processes, evolving better hyperparameter combinations through operations like crossover, mutation, and selection. However, these methods primarily automate the generation of hyperparameters without fully leveraging domain expertise in the CNN field. Incorporating domain expertise in a CNN can also be an effective method for improving classification of minority class samples.

Therefore, despite some attempts, challenges, and unresolved issues, selecting CNN parameters for imbalanced data remains essential. Future research can aim to comprehensively integrate domain expertise in the CNN field with existing optimization techniques to develop more effective parameter selection methods for imbalanced data. This approach seeks to enhance the performance and generalization ability of CNN models for minority class samples. Thus, this paper discusses CNN parameter selection specifically for minority class samples in imbalanced data, with the aim of optimizing existing network structures to achieve ideal classification results for these samples.

3. Background Models

3.1. Convolutional Neural Network

In convolutional neural networks, convolutional layers, pooling layers, and fully connected layers create an efficient framework for feature extraction and classification, as shown in Figure 1. Convolutional layers extract local features from images, pooling layers reduce the size of feature maps and the number of parameters, lowering model complexity, and fully connected layers convert feature maps into one-dimensional vectors for classification. Although stacking multiple convolutional layers and using numerous convolutional kernels can build more complex network structures, it also increases the computational resource requirements and risk of overfitting. Therefore, when designing a CNN, a balance must be struck between network complexity and performance. A deep understanding of CNN structure and parameter selection is crucial for enhancing the performance of image-processing technologies.

3.2. Parameters in CNN Model

Learning rate. The learning rate determines the step size of the parameter updates during training. A too-small learning rate can result in slow convergence and potential trapping in local minima, whereas a large learning rate can result in oscillations and divergence during training. An appropriate learning rate strikes a balance between training speed and model performance.

Dropout rate. The dropout rate parameter in the dropout layer randomly drops some neurons and their connections during the training, as shown in Figure 2. Not applying dropout can lead to overfitting on the training set, whereas applying dropout can reduce overfitting and improve the model’s generalization on unseen data. Therefore, the decision to use dropout should be based on specific circumstances and dataset characteristics to achieve an optimal model performance. In our setup, dropout was applied differently after convolutional layers and fully connected layers [39], denoted as dropout rate 1 and dropout rate 2, respectively.

Convolutional kernel size. Convolutional kernel size refers to the dimensions of the small matrix used for feature extraction in the CNN. Adjusting the kernel size can influence the scale and complexity of the features learned by the network, thereby impacting its performance across different tasks.

3.3. Evaluation Metrics

The confusion matrix is essential for evaluating the model performance in classification problems. It is a two-dimensional matrix with rows representing the true classes and columns representing the predicted classes, as shown in Table 1. Each element of the confusion matrix represents the prediction of the model for samples belonging to a certain true class.

True positive (

T P

) refers to the number of actual positive instances that the model correctly predicts as positive; false positive (

F P

) refers to the number of negative instances that the model incorrectly predicts as positive; false negative (

F N

) refers to the number of positive instances the model incorrectly predicts as negative; and true negative (

T N

) refers to the number of actual negative instances that the model correctly predicts as unfavorable. Based on these metrics, we can calculate many evaluation metrics for assessing the performance of classification models, including accuracy, recall, F1 score, and specificity.

Accuracy: Accuracy refers to the proportion of samples correctly predicted by the model out of the total number of samples.

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(1)

Precision: Precision refers to the proportion of samples that the model correctly predicts as positive out of all samples predicted as positive.

Precision = \frac{T P}{T P + F P}

(2)

Recall: Recall refers to the proportion of samples that the model correctly predicts as positive among all positive samples.

Recall = \frac{T P}{T P + F N}

(3)

Specificity: This measures a model’s ability to correctly predict negative instances among all true negatives in a binary classification problem. Specifically, it refers to the proportion of true negatives that the model correctly identified as unfavorable.

Specificity = \frac{T N}{T N + F P}

(4)

F1 score: F1 score is the harmonic mean of precision and recall and is used for the comprehensive evaluation of model performance.

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(5)

G-mean: The G-mean is a composite evaluation metric commonly used to assess the performance of binary classification models on imbalanced datasets.

G - mean = \sqrt{Recall \times Precision}

(6)

P-mean: P-mean evaluates the performance of a classifier by computing the geometric mean of precision and recall. Unlike the F1 score, the P-mean does not apply a weighted average to precision and recall, making it particularly valuable in scenarios where the costs of false positives and false negatives differ.

P - mean = \sqrt{Recall \times Specificity}

(7)

These evaluation metrics are all based on a binary classification performance evaluation. In this study, to address class imbalance in multiclass problems, we adopted the one-vs.-rest (OvR) [40] approach. This method treats the focus class (minority class samples) as positive. It combines the remaining nine classes into a negative class, transforming the multiclass problem into a series of binary classification problems. By utilizing these traditional binary classification metrics, we can better assess the performance of the model in handling imbalanced multiclass datasets, particularly its ability to recognize minority class samples, and overall classification performance. These metrics provide a comprehensive classifier evaluation across different classes, aiding in optimizing the model parameters to improve the classification accuracy and recall.

4. Statistical Model

4.1. Fractional Factorial Design

Fractional factorial design is a comprehensive experimental method that involves the cross-grouping of multiple factors at multiple levels. This allows the investigation of the main effects and interaction effects of two or more factors. When studying how to enhance the performance of convolutional neural networks on imbalanced datasets, fractional factorial design can be employed for comparative experiments [41]. This approach can explore the main effects and interaction effects of factors such as learning rate (LR), dropout rate 1 (DR 1), dropout rate 2 (DR 2), and kernel size (KS) on the accuracy, recall, G-mean, and P-mean of the CNN model. Dropout rate 1 and dropout rate 2 refer to the parameters in the dropout layers following the pooling layers and after the fully connected layers, respectively.

4.2. Fractional Factorial Design

ANOVA is a method used to compare the means of multiple groups by analyzing the total variance and separating the factor and error effects to test whether there are significant relationships between multiple groups. To validate the results of the fractional factorial design experiment, we used an ANOVA model to determine the impact of the explanatory variables LR, DR 1, DR 2, and KS on the dependent variables, which, in our experiment, were the evaluation metrics.

4.2.1. Fixed-Effects Model

In the fixed-effects model, the levels of the factors are specific levels of interest to the researcher. We assume that these levels are fixed and are only interested in the effects at these particular levels. This model is applicable when assessing the impact of specific treatments or experimental conditions on the response variable.

In this article, we establish the following model:

Let

Y_{i j k l}

be the observation value at the i level of the LR, the j level of DR 1, the k level of DR 2, and the l level of convolution KS. All observation values

Y_{i j k l}

are mutually independent and follow a normal distribution with a constant variance

σ^{2}

.

We assume that the number of levels for each factor is

r = 5

,

c = 5

,

t = 5

, and

u = 2

, and each combination has the same number of replicate observations

s > 1

. Therefore, the total number of observations is

n = r \times c \times t \times u \times s

.

The formula for the fixed effects model is as follows:

Y_{i j k l} = μ + α_{i} + β_{j} + γ_{k} + δ_{l} + {(α β)}_{i j} + . . . + {(γ δ)}_{k l} + ϵ_{i j k l}

(8)

where

μ

is the overall mean,

α_{i}

is the main effect of the learning rate,

β_{j}

is the main effect of DR 1,

γ_{k}

is the main effect of DR 2,

δ_{l}

is the main effect of convolution kernel size, and

ϵ_{i j k l}

is the random error, which follows

N (0, σ^{2})

. The rest are two-level interaction effects.

Through permutation tests, we test each

(i, v)

:

H_{v i} : e_{v}^{T} Z_{i} = 0 vs . K_{v i} : e_{v}^{T} Z_{i} \neq 0

(9)

where

Z_{i}

is the aligned data vector at level i. We use

e_{v}

to represent the orthogonal vector associated with degree v. Thus,

e_{v}^{T} Z_{i}

is the projection of the aligned data vector at level i to degree v. The null hypothesis

H_{v i}

indicates no significant interaction effect i and degree v.

When conducting hypothesis tests, we typically use the F-test to evaluate whether the model’s main and interaction effects are significant. The sum of squares for main effects (

S S_{A}

,

S S_{B}

) and the sum of squares for interaction effects (

S S_{A B}

) measure the influence of different factors on the response variable.

For example, for LR (A) and DR 1 (B), the formulas for calculating the sum of squares for the main effects and interaction effects are as follows:

S S_{A} = s \sum_{i = 1}^{r} {({\bar{Y}}_{i . .} - \bar{Y})}^{2}

(10)

S S_{B} = s \sum_{j = 1}^{c} {({\bar{Y}}_{. j .} - \bar{Y})}^{2}

(11)

S S_{A B} = s \sum_{i = 1}^{r} \sum_{j = 1}^{c} {({\bar{Y}}_{i j .} - {\bar{Y}}_{i . .} - {\bar{Y}}_{. j .} + \bar{Y})}^{2}

(12)

4.2.2. Estimation of Marginal Means and 95% Confidence Intervals

To provide a more comprehensive interpretation of the effects, we employed the method of estimating marginal means and 95% confidence intervals. This method involves calculating the marginal means for each factor level and providing confidence intervals for these estimates to represent the precision and variability of the effects visually.

The estimates of the marginal means are given by the following:

{\hat{Y}}_{A_{i}} = \hat{μ} + {\hat{α}}_{i}

(13)

{\hat{Y}}_{B_{j}} = \hat{μ} + {\hat{β}}_{j}

(14)

{\hat{Y}}_{i j} = \hat{μ} + {\hat{α}}_{i} + {\hat{β}}_{j} + {(α β)}_{i j}

(15)

The 95% confidence interval (CI) is given by

C I = \hat{Y} \pm t_{α / 2, d f} \cdot S E (\hat{Y})

(16)

where

t_{α / 2, d f}

is the critical value from the t-distribution, and

S E (\hat{Y})

is the standard error of the estimate, calculated using the residuals from the ANOVA model as follows:

S E ({\hat{Y}}_{i j}) = \sqrt{\frac{M S W}{n_{i j}}}

(17)

Through these steps, we can comprehensively analyze the impact of various factors on the performance of the model and further interpret these effects.

5. Experimental Design

This paper aims to provide a detailed explanation of the impact of convolutional neural network parameters on imbalanced data and their optimization, as shown in Figure 3. Next, we describe the functionalities of the proposed framework.

5.1. Dataset Description

5.1.1. CIFAR-10 Dataset

CIFAR-10 is a commonly used dataset for image classification, primarily used for research and algorithm evaluation in the realm of computer vision. The dataset consists of color images categorized into 10 classes, with each class containing 6000 images of size 32 × 32 pixels, totaling 60,000 images. Among these, 50,000 images were used as training samples and 10,000 were used as test samples. These images cover various common object categories in daily life, including airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks, as shown in Figure 4.

5.1.2. Fashion-MNIST Dataset

The Fashion-MNIST dataset, released by Zalando Research, is an image classification benchmark dataset containing 70,000 grayscale images. These images belong to ten different categories of clothing and footwear, including T-shirt, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot, as shown in Figure 5. Compared to the traditional MNIST handwritten digit dataset, the Fashion-MNIST dataset is more diverse and challenging, featuring rich textures and shapes that closely resemble actual clothing products. The Fashion-MNIST dataset is widely used in image classification, computer vision, machine learning, and deep learning, and provides a more practical benchmark for evaluating and developing advanced image-processing algorithms.

Because this study explores the impact of neural network parameters on imbalanced data, we extracted varying numbers of samples from ten classes of the CIFAR-10 and Fashion-MNIST datasets to create three datasets with different levels of balance: highly imbalanced, moderately imbalanced, and balanced, with imbalance ratios of 25:1, 15:1, and 1:1, respectively. The number of samples extracted for the training set from both datasets is listed in Table 2, where each class in the test set of both datasets contains 1000 images.

5.2. Fractional Factorial Design

Each dataset may contain hundreds to thousands of eligible convolutional neural network models. The previous analysis identified four parameters that significantly affect imbalanced data: LR, DR 1, DR 2, and KS. LR, DR 1, and DR 2 each have five levels, whereas KS has two levels, labeled as A1-A4, B1-B4, C1-C4, and D1-D2, respectively. Table 3 lists each factor and its levels. We conducted a systematic experiment using ANOVA on these factors and their levels to explore each factor’s effects on and interactions with the evaluation metrics.

In this paper, we adopt a method based on Lagrange multipliers to calculate the number of convolution kernels in five convolutional layers [38]. First, we define the number of kernels in each layer as

n_{i}

, where i denotes the index of the convolution layer. According to the structure of the convolutional neural network, we can compute the total number of learned parameters using the following formula:

P_{all} = k \times k \times n_{0} \times n_{1} + \sum_{i = 2}^{N} (k \times k \times n_{i - 1} \times n_{i}) = k^{2} \times \sum_{i = 1}^{N} (n_{i - 1} \times n_{i})

(18)

where

P_{all}

represents the total number of learned parameters, k is the kernel size, N is the total number of convolution layers, and

n_{0}

is the number of input channels.

Next, we construct the optimization problem using the Lagrange function:

L (n, λ_{0}, \dots, λ_{i}, \dots, λ_{N}) = \sum_{i = 1}^{N} log (n_{i}) + λ_{0} (\sum_{i = 1}^{N} n_{i - 1} \times n_{i} - C) + \sum_{i = 1}^{N} λ_{i} (n_{i - 1} - n_{i})

(19)

where C represents the total number of parameters in the network. By calculating the partial derivatives of the Lagrange function, we can determine the optimal configuration of convolution kernels. In this process, we focus on the relationship between the model’s discriminative performance and parameter distribution, ensuring that the selected parameter configuration effectively enhances the ability to recognize minority class samples. This way, we can not only obtain the number of kernels for each layer but also optimize model performance to better address the challenges posed by imbalanced datasets.

We calculated the number of convolutional kernels for the five layers to be 103, 105, 107, 109, and 179. Subsequently, we fixed the number of kernels and employ a fractional factorial design to explore the relationship between the parameters and the evaluation metrics for the minority class samples. The framework structure of the CNN model used in this study is shown in Table 4, and we adjusted the internal parameters as needed based on different requirements.

We conducted fractional factorial design experiments and training on three imbalanced datasets extracted from CIFAR-10 and Fashion-MNIST using the same convolutional neural network layers and batch sizes. Each dataset underwent four repeated experiments to minimize randomness, with a total of 250 data groups per experiment. Based on the ANOVA results, we evaluated the effects of different factors and their interactions with various performance metrics, such as accuracy, recall, F1 score, G-mean, and P-mean, focusing on minority class samples.

5.3. Analysis of Variance Results

5.3.1. ANOVA Results under Different Imbalance Ratios

We conducted comprehensive ANOVA experiments to gain a deeper understanding and quantify the impact of different parameters on the CNN model performance across various levels of imbalance. ANOVA was conducted on datasets with imbalance ratios of 25:1, 15:1, and 1:1, showing the main and interactions effects of LR, DR 1, DR 2, and KS on CNN model accuracy.

Table 5 presents partial results from the fractional factorial design experiments conducted on the CIFAR-10 dataset, encompassing a total of 250 data points. For clarity and conciseness, Table 5 displays only rows 1–4, 124–126, and 247–250, representing the overall dataset. Each parameter configuration was tested four times, and the accuracy, F1 score, G-mean, P-mean, and recall values were obtained for the minority class in each experiment. The same experiments were conducted using the Fashion-MNIST dataset. These data were used for variance analysis to assess the main and interaction effects of different factors on model performance. Through repeated experiments, we can better understand the impact of each parameter on the model performance and identify the optimal parameter combination to enhance the model’s classification performance on imbalanced datasets. The data in the table demonstrate the consistency and variability of the results under the same parameter configurations, providing reliable support data for the subsequent variance analysis.

Comparing the results from Table 6, Table 7 and Table 8, it is evident that LR, KS, and their interaction significantly impact the model accuracy across all imbalance ratios. The results indicate that these two parameters are crucial tuning factors across different levels of imbalanced datasets.

In the CIFAR-10 dataset with an imbalance ratio of 25:1, both LR and KS significantly affected model accuracy, with p-values of

6.508 \times 10^{- 8}

and 0.036, respectively. In the same dataset, the interaction effect between LR and KS (LR × KS) also showed a significant impact, with an F-value of 27.600 and a p-value of

7.358 \times 10^{- 19}

. The combination of these two parameters significantly influenced the model performance.

LR and KS exhibited significant effects on the CIFAR-10 dataset with an imbalance ratio of 15:1, with p-values of

1.622 \times 10^{- 11}

and

4.925 \times 10^{- 7}

, respectively. The interaction effect between LR and KS (LR × KS) was particularly noteworthy, with an F-value of 38.768 and a p-value of

5.079 \times 10^{- 25}

, indicating a highly significant impact.

In the balanced CIFAR-10 dataset, LR and KS still significant affected accuracy, with p-values of

1.125 \times 10^{- 7}

and

4.982 \times 10^{- 10}

, respectively. Furthermore, the interaction effect between LR and KS (LR × KS) also demonstrated a significant impact, with an F-value of 209.146 and a p-value of

5.673 \times 10^{- 17}

. However, DR 1 and DR 2 did not exhibit significant main or interaction effects on any dataset, with p-values greater than 0.1.

Just as we conducted experiments on the CIFAR-10 dataset, we carried out the same experiments on the Fashion-MNIST dataset and achieved similar results. Across different imbalance ratios, we found significant main effects and interaction effects of LR and KS on the CNN model accuracy. In the Fashion-MNIST dataset with an imbalance ratio of 25:1, LR and KS exhibited p-values of

4.341 \times 10^{- 10}

and

9.776 \times 10^{- 4}

, respectively. The interaction effect (LR × KS) showed an F-value of 27.322 and a p-value of

1.069 \times 10^{- 18}

. Under the 15:1 imbalance ratio, LR and KS had p-values and F-values of

1.338 \times 10^{- 8}

and

8.591 \times 10^{- 4}

, respectively, with an interaction effect (LR × KS) p-value of

3.810 \times 10^{- 8}

and F-value of 13.800. In the balanced dataset, LR and KS significantly influenced the CNN model accuracy with p-values of

3.601 \times 10^{- 9}

and

1.184 \times 10^{- 3}

, respectively. The interaction effect (LR × KS) had an F-value of 12.222, although its impact was slightly diminished compared with that in the highly imbalanced dataset. These results underscore that the observed effects are not incidental but robust validations of their impact on model performance.

The previous section demonstrates the significant impact of various parameters on accuracy. To further explore the effects of these parameters on recall under different imbalance ratios, we conducted an analysis of variance to obtain the p-values for the main effects and interaction effects of learning rate, dropout rate 1, dropout rate 2, and kernel size on the accuracy of minority class samples. Due to the substantial differences in the p-values, we standardized each p-value using the following formula:

Normalized Value = \frac{x - Minimum}{Maximum - Minimum}

(20)

This standardization process ensures that the effects of different parameters at various imbalance ratios can be compared on the same scale, thereby guaranteeing fairness and accuracy in the data analysis. By employing this method, we were able to illustrate and compare the substantial impact of various parameters on recall.

As can be seen from Figure 6, the impact of different parameters on the model performance varied across different imbalance ratios. In highly imbalanced datasets, LR, KS, and their interaction effects significantly influenced the model. In moderately imbalanced datasets, the effects of LR and KS were more pronounced. In balanced datasets, the interaction effects of LR and KS (LR × KS) and the interaction effects of DR 1 and KS (DR 1 × KS) significantly impacted model performance. Based on these findings, it is possible to more precisely select and adjust the parameters of CNN models to improve their performance on imbalanced data.

In summary, LR and KS significantly impacted the overall model performance across all imbalanced datasets, especially in highly imbalanced datasets. The interaction between these two parameters also significantly influences the model’s performance. In balanced datasets, LR and KS still significantly affect the model performance. However, their effects are less pronounced than those for highly imbalanced datasets, whereas minor parameters such as DR 1 and DR 2 have a minor impact on model performance. In subsequent optimizations, attention should be focused on LR and KS, and their optimal combinations should be explored to enhance the performance of CNN models on imbalanced datasets.

5.3.2. Detailed Analysis of 25:1 Imbalanced Data on the CIFAR-10 Dataset

To gain a deeper understanding of the impact of the parameters on extremely imbalanced datasets, we conducted a detailed analysis of data with an imbalance ratio of 25:1. At this ratio, the model performance is susceptible to parameter settings, making the identification of critical parameters crucial for optimizing the model performance.

Because the independent variables that significantly impact the evaluation metrics are LR and KS, we do not discuss the main effect results of the dropout rate in detail. Figure 7 shows the main effects of LR on various metrics, highlighting the significant impact of LR on each metric. The best performance for each metric occurred when LR ranged from 0.001 to 0.031, peaking at approximately 0.031. Therefore, selecting an appropriate LR is crucial for imbalanced data classification and can significantly enhance various metrics.

Figure 8 shows the overall impact of KS on various evaluation metrics for imbalanced and balanced data. The results indicate that smaller kernel sizes perform better in terms of accuracy, F1 score, G-mean, and P-mean, reflecting better overall balanced performance. On the other hand, larger kernel sizes significantly improve recall, but cause a decline in other performance metrics; thus, there is a significant trade-off when choosing the kernel size for imbalanced data classification. The selection of the most suitable kernel configuration should consider the specific performance requirements of the application and the importance of the different evaluation metrics.

Figure 9 illustrates the interaction effects of different parameter combinations on the accuracy of the imbalanced data classification. In the first subplot, the x-axis represents the LR, the y-axis represents the mean accuracy, and the colors denote different levels of DR 1. It can be observed that accuracy increases as the learning rate increases from 0.001 to 0.031 but sharply declines at a learning rate of 0.1. The impact of different DR 1 levels on the accuracy is relatively minor. Similar to in the previous subplot, the effect of different DR 2 levels on the accuracy is also relatively small. Additionally, the combined effects of DR 1 and DR 2 on the accuracy are complex.

In summary, LR and KS are the primary factors that significantly affect various performance metrics, while DR 1, DR 2, and their interactions have a minor impact on these metrics. To optimize the performance, the focus should be on the learning rate, kernel size, and interaction. LR = 0.01, a moderate learning rate, provides an adequate gradient update step size during model training. This allows the model to effectively learn the features of the training data while avoiding oscillations and instability caused by overly large gradient steps. DR 1 = 0.02, DR 2 = 0.02, and relatively low dropout rates help prevent overfitting while ensuring the model’s training efficiency and generalization ability. These two parameters enhance the robustness of the model while maintaining its simplicity. KS = 3 × 3, a smaller kernel size, allows the model to capture detailed image features better, thereby improving its ability to recognize minority class samples. The smaller kernel also reduces the number of parameters, thereby lowering the computational complexity. In the dataset with an imbalance ratio of 25:1, using fractional factorial design and ANOVA, we identified that the optimal parameter combination is A3B2C2D1.

5.3.3. Detailed Analysis of 25:1 Imbalanced Data on the Fashion-MINST Dataset

To validate the model’s reliability, enhance the credibility of the research, and demonstrate its performance across various scenarios, we conducted experiments using 25:1 imbalanced data extracted from the Fashion-MNIST dataset. Based on the setup in Table 3, we considered four factors: LR, DR 1, DR 2, and KS. The first three factors had five levels, labeled from A1 to C5, whereas KS had two levels. We performed an ANOVA to evaluate the main effects of these parameters and their interaction effects on the minority class samples. The experimental design included 250 experiments across the four factors, each repeated four times to ensure the robustness and accuracy of the results.

Based on the ANOVA results for the Fashion-MNIST dataset, as shown in Figure 10, the optimal parameter combination A2B3C3D1 was determined through an analysis of the interaction effects of LR, DR 1, DR 2, and KS. Specifically, with a lower learning rate of 0.031, the model exhibited higher and more stable accuracy when both dropout rate 1 and dropout rate 2 were 0.05, and the model’s average accuracy under different combinations was relatively good and stable. When the kernel size was 3, the model performed most stably across various parameter combinations. In summary, the optimal scheme is a learning rate of 0.031, dropout rate 1 of 0.05, dropout rate 2 of 0.05, and kernel size of 3. This combination ensures the model performance while reducing the risk of overfitting and instability.

5.4. Parameter Optimization

The basic framework of the models we used included five convolutional layers, each with a 3 × 3 kernel size, three max-pooling layers, and three dropout layers, as shown in Table 4. In our experiments, we designed five different CNN models to explore the effect of parameter tuning on model performance on an imbalanced dataset. We compared these with a traditional undersampling model for handling imbalanced data.

Model 1: This is the baseline model using the default number of convolution kernels (32, 64, 96, 128, 256), a default learning rate of 0.001, kernel size of 3 × 3, and dropout rate of 0.2.

Model 2: Based on Model 1, the number of convolution kernels is adjusted to 103, 105, 107, 109, and 179 while the other settings are unchanged.

Model 3: Similar to Model 2 but with a kernel size of 5 × 5, keeping the other parameters the same.

Model 4: Based on the baseline model, this model adjusts both the number of convolution kernels and learning rate while the other parameters remain at their default values.

Model 5: The adjustments made in Models 3 and 4 are combined by changing the convolution kernel number, size, and learning rate.

The models were trained using the cross-entropy loss function, and evaluated based on five metrics: accuracy, recall, F1 score, G-mean, and P-mean. These metrics were employed to thoroughly assess the performance across each data category.

During the training process, the number of epochs was not fixed. Instead, we set a stopping condition; if the difference in training accuracy between the current epoch and the previous epoch was less than 1 × 10⁻² (i.e., 0.01), the training stopped. This approach helped avoid overtraining the model once a certain accuracy level was reached.

To optimize the CNN’s performance on the minority class samples, we employed the two significant parameters mentioned above: LR and KS. By iteratively adjusting these parameters, we identified the optimal network configuration that performs best in terms of primary and interaction effects, enhancing the recognition capability of minority class samples.

5.4.1. Results of CIFAR-10 Dataset

Table 9 presents the final experimental results, with accuracy representing the overall accuracy and the remaining metrics evaluating the performance of minority class samples. Based on the analysis of the experimental results, we observed that modifying the kernel size and learning rate had a significant impact on the CNN model’s performance, especially with imbalanced datasets. Model 1, serving as the baseline model, showed lower metrics, particularly recall and F1 score, at 0.22 and 0.36, respectively, indicating poor performance on minority class samples. Although Model 2’s overall accuracy decreased to 0.49, its recall significantly improved to 0.16, demonstrating an enhanced ability to recognize minority class samples. Model 3, with the kernel size adjusted to 5 × 5, significantly improved the overall accuracy, recall, F1 score, and G-mean, reaching 0.65, 0.33, 0.72, and 0.78, respectively, indicating a substantial improvement in recognition and accuracy for the minority class samples. Model 4 adjusted only the learning rate, achieving a 0.67 overall accuracy and improved performance on minority class samples, particularly in F1 score and G-mean, with results of 0.71 and 0.76, respectively. Model 5, which combined adjustments to the number of kernels, kernel size, and learning rate, achieved performance metrics close to those of Model 4 and slightly better than those of Model 3, with a recall of 30.96% and an F1 score of 0.71. The results demonstrate that simultaneously adjusting multiple parameters can yield better performance, as shown in Figure 11.

Based on the analysis of variance and tuning process described above, the optimal configuration for a dataset with an imbalance ratio of 25:1 was confirmed to be (0.013, 0.2, 0.2, 3 × 3).

Table 10 presents the performance metrics of the parameter-optimized CNN model, Model 1 with undersampling, and Model 1 with SMOTE for the CIFAR-10 dataset for each class. Compared to the standard data-processing methods, the parameter-optimized model demonstrated superior performance in most categories in terms of recall, F1 score, G-mean, and P-mean, especially for the minority class. In this study, the minority class corresponds to the automobile category. For the minority class, the recall, F1 score, G-mean, and P-mean of Model 5 reached 0.31, 0.71, 0.76, and 0.94, respectively, whereas Model 1 with undersampling only achieved 0.23, 0.37, 0.48, and 0.52, respectively. The results indicate that, under a 25:1 imbalance ratio, optimizing the parameter selection improves the classification performance of the minority class, highlighting the importance of optimizing parameter selection in handling imbalanced data scenarios.

Compared to undersampling, Model 1 with SMOTE showed mixed results for the minority class. The SMOTE resulted in a lower recall than undersampling, only reaching 0.12, though it slightly improved the F1 score to 0.22. Despite the slight improvement in F1 score, the G-mean and P-mean values for the SMOTE were still significantly lower than those of the parameter-optimized Model 5. This indicates that, while the SMOTE’s synthetic samples provide some improvements in certain metrics, its effectiveness in enhancing recall for the minority class is limited. This further underscores that, in cases of extreme imbalance, optimizing model parameters is more effective than solely relying on SMOTE oversampling.

In terms of the G-mean and P-mean metrics, Model 5 demonstrated a more balanced and consistent performance. For other class data, Model 5 achieved a recall rate of 0.90 for the airplane category, compared to 0.87 achieved by Model 1 with undersampling. In the frog category, Model 5 had an F1 score of 0.77, while Model 1 had an undersampling score of 0.70.

Figure 12 illustrates this phenomenon. Compared with using undersampling data processing alone (red region), the model with parameter optimization (blue region) significantly improved the metrics for minority classes (such as the car category). Recall rate, F1 score, G-mean, and P-mean were significantly enhanced. The results indicate that parameter optimization notably improved the model’s ability to identify minority class samples. Furthermore, this method not only improved the performance of minority class samples, but also maintained high performance across other classes and, in some cases, showed improvement.

This further underscores the importance of parameter selection, particularly when dealing with minority class samples. Optimizing the parameter settings can significantly enhance the recognition capabilities and classification performance, resulting in better performance on imbalanced datasets.

5.4.2. Results of Fashion-MINST Dataset

The same methodology was applied to the Fashion-MNIST dataset. Based on the ANOVA results, the optimal configuration for the 25:1 imbalanced dataset was determined to be (0.031, 0.05, 0.05, 3 × 3). The results obtained from parameter optimization, specifically, the results of Model 5, were compared with the performance of the five models defined above. Figure 13 and Table 11 present the final results of the five models. The experimental findings demonstrate that adjusting the number of convolution kernels, their size, and the learning rate affects the performance of CNN models, especially in handling imbalanced datasets. Model 1 showed poor performance on minority class samples with a recall of 0.28 and an F1 score of 0.42. Model 2 adjusted the number of convolution kernels and improved the recall to 0.31, with a slight increase in overall accuracy. Model 3 further changed the kernel size to 5 × 5, maintaining an accuracy of 0.84 but experiencing slight decreases in the recall and F1 score. Model 4, adjusting the learning rate, showed improved performance with a recall of 0.28 and an F1 score of 0.41. Model 5, which combined adjustments to kernel number, size, and learning rate, achieved the best performance with an overall accuracy of 0.87, recall of 0.38, and F1 score of 0.50. These results are consistent with those of the experiments on the CIFAR-10 dataset, highlighting the significant impact of parameter selection on the recognition performance of minority class samples.

Table 12 presents the performance metrics for each class for the Fashion-MNIST dataset for Model 5 and the undersampled model, with bold text representing the results of Model 5. Both T-shirt and shirt are minority samples. Model 5 performed exceptionally well for class 1, with a recall of 0.97, an F1 score of 0.78, a G-mean of 0.95, and a P-mean of 0.77. In contrast, for class 6, the undersampled model achieved a recall and F1 score of 0.46 and 0.53, respectively, while these metrics significantly improved to 0.88 and 0.61 through parameter adjustments in Model 5.

Specifically, the data in the table show that, although class 6 performed poorly after undersampling, its recall and F1 score significantly improved through parameter adjustments in Model 5. The results indicate that appropriate adjustments to the number and size of convolution kernels and the learning rate can effectively enhance the recognition performance of minority samples (especially class 6), thereby improving the model’s overall performance when handling imbalanced datasets.

In summary, Table 12 and Figure 14 indicate that Model 5, through adjustments to the number and size of convolution kernels and learning rate, significantly improved the performance of minority class samples in imbalanced datasets. This improvement was particularly notable for class 6, the shirt category. Model 5 demonstrated superior overall performance in terms of recall, F1 score, G-mean, and P-mean compared with the undersampling method, highlighting the effectiveness of parameter adjustments in enhancing the recognition performance of minority class samples.

6. Conclusions

This study aimed to investigate the impact of CNN parameter distributions on imbalanced data and improve the classification performance of minority class samples by optimizing the CNN model parameters. Quantitative evaluation metrics included accuracy, recall, F1 score, G-mean, and P-mean. We found that parameter selection significantly influence imbalanced data and specifically optimized additional hyperparameters of the CNN model for the CIFAR-10 dataset with a maximum imbalance ratio of 25:1 and the Fashion-MNIST dataset. The results demonstrate that the learning rate and kernel size notably affect the performance of the CNN model for imbalanced data. In CIFAR-10, adjusting the learning rate to 0.013 and setting the kernel size to 3 × 3, and, in Fashion-MNIST, adjusting the learning rate to 0.0031 and setting the kernel size to 3 × 3, led to improvements in accuracy by 14.20% and 3.87%, respectively, compared to traditional undersampling methods. For CIFAR-10, the recall, F1 score, G-mean, and P-mean of the minority class samples improved by 8%, 34%, 28%, and 42%, respectively, compared to the undersampling methods. For Fashion-MNIST, these metrics improved by 12%, 9%, 10%, and 4%, respectively. These findings underscore the importance of parameter selection in handling imbalanced data, particularly kernel size and learning rate adjustment, which play a crucial role in recognizing minority class samples.

Despite achieving significant results in this study, further exploration of parameter optimization methods is necessary. This study focused solely on understanding and studying a single CNN model and did not explore other models. Considering the relatively scarce research on deep learning methods in the context of imbalanced and large datasets, further exploration will significantly advance deep learning methods in big data analytics. Future research on deep learning will likely continue in this challenging direction and attract continued attention from those in academia and industry. For more details, please refer to our GitHub project: https://github.com/2221335/Ao (accessed on 27 September 2024).

Author Contributions

Conceptualization, N.W.; methodology, N.W.; software, R.Z.; validation, R.Z. and N.W.; formal analysis, R.Z.; investigation, R.Z.; writing—original draft preparation, R.Z.; writing—review and editing, N.W.; visualization, R.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Project of “Double First-Class” Disciplines in Heilongjiang Province, grant number LJGXCG2022-057.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used for this study are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. AlexNet: ImageNet classification with deep convolutional neural networks. Proc. NIPS 2018, 1097–1105. [Google Scholar]
Seera, M.; Lim, C.P. A hybrid intelligent system for medical data classification. Expert Syst. Appl. 2014, 41, 2239–2249. [Google Scholar] [CrossRef]
Alam, T.M.; Shaukat, K.; Khan, W.A.; Hameed, I.A.; Almuqren, L.A.; Raza, M.A.; Aslam, M.; Luo, S. An efficient deep learning-based skin cancer classifier for an imbalanced dataset. Diagnostics 2022, 12, 2115. [Google Scholar] [CrossRef]
Awoyemi, J.O.; Adetunmbi, A.O.; Oluwadare, S.A. Credit card fraud detection using machine learning techniques: A comparative analysis. In Proceedings of the 2017 International Conference on Computing Networking and Informatics (ICCNI), Lagos, Nigeria, 29–31 October 2017; IEEE: New York, NY, USA, 2017; pp. 1–9. [Google Scholar]
Sun, Y.; Zhao, T.; Zou, Z.; Chen, Y.; Zhang, H. Imbalanced data fault diagnosis of hydrogen sensors using deep convolutional generative adversarial network with convolutional neural network. Rev. Sci. Instrum. 2021, 92, 095007. [Google Scholar] [CrossRef]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
Drummond, C.; Holte, R.C. C4. 5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Proceedings of the Workshop on Learning from Imbalanced Datasets II, Washington, DC, USA, 21 August 2003; Volume 11. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Elkan, C. The foundations of cost-sensitive learning. In Proceedings of the International Joint Conference on Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001; Lawrence Erlbaum Associates Ltd.: Mahwah, NJ, USA, 2001; Volume 17, pp. 973–978. [Google Scholar]
Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man, Cybern. Part C (Appl. Rev.) 2011, 42, 463–484. [Google Scholar] [CrossRef]
Fernández, A.; Garcia, S.; Herrera, F.; Chawla, N.V. SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J. Artif. Intell. Res. 2018, 61, 863–905. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, S.; Yu, Q.; Zhang, X.; Liu, B.; Zhang, C. Learning from imbalanced data: A review and experimental study. ACM Comput. Surv. (CSUR) 2017, 50, 1–35. [Google Scholar]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
Das, M.N.; Giri, N.C. Design and Analysis of Experiments; New Age International: Delhi, India, 1979. [Google Scholar]
Lujan-Moreno, G.A.; Howard, P.R.; Rojas, O.G.; Montgomery, D.C. Design of experiments and response surface methodology to tune machine learning hyperparameters, with a random forest case-study. Expert Syst. Appl. 2018, 109, 195–205. [Google Scholar] [CrossRef]
Garofalo, S.; Giovagnoli, S.; Orsoni, M.; Starita, F.; Benassi, M. Interaction effect: Are you doing the right thing? PLoS ONE 2022, 17, e0271668. [Google Scholar] [CrossRef]
Chawla, N.V.; Japkowicz, N.; Kotcz, A. Special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 2004, 6, 1–6. [Google Scholar] [CrossRef]
Shin, K.; Han, J.; Kang, S. MI-MOTE: Multiple imputation-based minority oversampling technique for imbalanced and incomplete data classification. Inf. Sci. 2021, 575, 80–89. [Google Scholar] [CrossRef]
Wang, X.; Xu, J.; Zeng, T.; Jing, L. Local distribution-based adaptive minority oversampling for imbalanced data classification. Neurocomputing 2021, 422, 200–213. [Google Scholar] [CrossRef]
Seliya, N.; Khoshgoftaar, T.M.; Van Hulse, J. A study on the relationships of classifier performance metrics. In Proceedings of the 2009 21st IEEE International Conference on Tools with Artificial Intelligence, Newark, NJ, USA, 2–4 November 2009; IEEE: New York, NY, USA, 2009; pp. 59–66. [Google Scholar]
Mani, I.; Zhang, I. kNN approach to unbalanced data distributions: A case study involving information extraction. In Proceedings of the Workshop on Learning from Imbalanced Datasets, Washington, DC, USA, 21 August 2003; ICML: Vienna, Austria, 2003; Volume 126, pp. 1–7. [Google Scholar]
Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; IEEE: New York, NY, USA, 2008; pp. 1322–1328. [Google Scholar]
Huang, C.; Li, Y.; Loy, C.C.; Tang, X. Learning deep representation for imbalanced classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5375–5384. [Google Scholar]
Mease, D.; Wyner, A.J.; Buja, A. Boosted classification trees and class probability/quantile estimation. J. Mach. Learn. Res. 2007, 8, 409–439. [Google Scholar]
Wu, C.; Wang, N.; Wang, Y. Increasing Minority Recall Support Vector Machine Model for Imbalanced Data Classification. Discret. Dyn. Nat. Soc. 2021, 2021, 6647557. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Zhong, X.; Wang, N. Ensemble learning method based on CNN for class imbalanced data. J. Supercomput. 2024, 80, 10090–10121. [Google Scholar] [CrossRef]
Zhou, B.; Cui, Q.; Wei, X.S.; Chen, Z.M. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9719–9728. [Google Scholar]
Ilham, A.; Silva, J.; Mercado-Caruso, N.; Tapias-Ruiz, D.; Lezama, O.B.P. Impact of class imbalance on convolutional neural network training in multi-class problems. In Proceedings of the Image Processing and Capsule Networks: ICIPCN 2020, Bangkok, Thailand, 6–7 May 2020; Springer: Berlin/Heidelberg, Germany, 2021; pp. 309–318. [Google Scholar]
Liao, L.; Zhao, Y.; Wei, S.; Wei, Y.; Wang, J. Parameter distribution balanced CNNs. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 4600–4609. [Google Scholar] [CrossRef] [PubMed]
Yoo, J.H.; Yoon, H.i.; Kim, H.G.; Yoon, H.S.; Han, S.S. Optimization of hyper-parameter for CNN model using genetic algorithm. In Proceedings of the 2019 1st International Conference on Electrical, Control and Instrumentation Engineering (ICECIE), Kuala Lumpur, Malaysia, 25 November 2019; IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
Temraz, M.; Keane, M.T. Solving the class imbalance problem using a counterfactual method for data augmentation. Mach. Learn. Appl. 2022, 9, 100375. [Google Scholar] [CrossRef]
Nasiri, H.; Alavi, S.A. A Novel Framework Based on Deep Learning and ANOVA Feature Selection Method for Diagnosis of COVID-19 Cases from Chest X-Ray Images. Comput. Intell. Neurosci. 2022, 2022, 4694567. [Google Scholar] [CrossRef]

Figure 1. The working principle of a CNN.

Figure 2. Dropout layer working principle.

Figure 3. The framework proposed in this article.

Figure 4. Images from the CIFAR-10 dataset. The dataset consists of images divided into 10 unique classes. Each class contains 5000 images from the training and 1000 from the test datasets. These are small-resolution RGB color images with a size of 32 × 32 pixels.

Figure 5. Images from the Fashion-MNIST dataset. The dataset consists of images divided into 10 unique classes. These images are grayscale with a resolution of 28 × 28.

Figure 6. Effect of KS on recall in datasets with different imbalance ratios from the CIFAR-10 dataset.

Figure 7. Plot of main effect of learning rate on evaluation metrics for the CIFAR-10 dataset.

Figure 8. Plot of main effect of KS on evaluation metrics in highly imbalanced and balanced datasets from the CIFAR-10. (IB stands for imbalanced data, and B stands for balanced data).

Figure 9. Plot of interaction effect of four variables on evaluation metrics for the CIFAR-10 dataset.

Figure 10. Plot of interaction effect of four variables on evaluation metrics for the Fashion-MNIST dataset.

Figure 11. Performance metrics of minority class samples across the five models for the CIFAR-10 dataset.

Figure 12. Evaluation metrics of Model 5 and undersampling compared to Model 5 on various classes in the CIFAR-10 dataset.

Figure 13. Performance metrics box plot of the Fashion-MNIST dataset across five models with model highlights.(The red line represents the median).

Figure 14. Evaluation metrics of Model 5 and undersampling compared to Model 5 on various classes in the Fashion-MNIST dataset.

Table 1. Confusion matrix of actual vs. predicted classifications.

		Predicted
		Positive	Negative
Actual	Positive	$T P$	$F N$
Actual	Negative	$F P$	$T N$

Table 2. The number of samples extracted from each class in both the CIFAR-10 dataset and the Fashion-MNIST dataset is identical.

Class	0	1	2	3	4	5	6	7	8	9
25:1	5000	200	500	1500	350	800	200	400	1000	3000
15:1	5000	330	500	1500	450	850	350	400	1000	3000
1:1	5000	5000	5000	5000	5000	5000	5000	5000	5000	5000

Table 3. Different levels of each factor.

Parameter	Factor Level					Note: Factor Levels
Parameter	Low	Lower	Medium	Higher	High	Note: Factor Levels
LR (A)	0.001	0.003	0.01	0.032	0.1	5 Levels
DR 1 (B)	0.01	0.02	0.05	0.11	0.25	5 Levels
DR 2 (C)	0.01	0.02	0.05	0.11	0.25	5 Levels
KS (D)	3 × 3				5 × 5	2 Levels

Table 4. CNN model structure.

Layers	Configuration
Conv_1	32 filters, kernel, and ReLU
MaxPool_1	$2 \times 2$ kernel
Dropout_1	20%
Conv_2	64 filters, kernel, and ReLU
MaxPool_2	$2 \times 2$ kernel
Dropout_2	20%
Conv_3	32 filters, kernel, and ReLU
Conv_4	32 filters, kernel, and ReLU
Conv_5	32 filters, kernel, and ReLU
MaxPool_3	$2 \times 2$ kernel
Dropout_3	20%
Fully_1	256 neurons, ReLU
Fully_2	10 neurons, SoftMax

Table 5. Partial performance metrics and parameter values for each run of the CIFAR-10 dataset with an imbalance ratio of 25:1.

Count	LR (5 Levels)	DR 1 (5 Levels)	DR 2 (5 Levels)	KS (2 Levels)	Accuracy ( $Y_{A}$ )	F1 Score ( $Y_{F}$ )	G-Mean ( $Y_{G}$ )	P-Mean ( $Y_{P}$ )	Recall ( $Y_{R}$ )
1	1	1	1	1	0.559	0.562	0.538	0.543	0.708
					0.549	0.554	0.530	0.533	0.701
					0.562	0.561	0.539	0.530	0.703
					0.554	0.533	0.533	0.533	0.706
2	1	1	2	1	0.592	0.555	0.543	0.538	0.711
					0.550	0.552	0.530	0.527	0.702
					0.555	0.554	0.543	0.543	0.710
					0.552	0.552	0.543	0.538	0.707
3	1	1	3	1	0.555	0.559	0.533	0.543	0.704
					0.543	0.556	0.518	0.539	0.693
					0.555	0.554	0.533	0.527	0.707
					0.552	0.556	0.533	0.543	0.704
4	1	1	4	1	0.552	0.537	0.535	0.507	0.705
					0.542	0.536	0.518	0.520	0.694
					0.545	0.537	0.535	0.520	0.698
					0.552	0.536	0.520	0.507	0.699
⋯	⋯	⋯	⋯	⋯	⋯	⋯⋯	⋯	⋯
124	5	5	4	1	0.657	0.647	0.788	0.673	0.768
					0.667	0.648	0.787	0.674	0.767
					0.659	0.659	0.779	0.676	0.769
					0.654	0.661	0.789	0.672	0.776
125	5	5	5	1	0.647	0.642	0.780	0.674	0.768
					0.650	0.648	0.787	0.674	0.767
					0.645	0.643	0.781	0.675	0.768
					0.646	0.640	0.783	0.672	0.770
126	1	1	1	2	0.633	0.626	0.770	0.661	0.768
					0.632	0.625	0.769	0.659	0.767
					0.634	0.627	0.774	0.663	0.759
					0.631	0.624	0.772	0.662	0.764
⋯	⋯	⋯	⋯	⋯	⋯	⋯	⋯	⋯	⋯
247	5	5	2	2	0.578	0.571	0.561	0.559	0.726
					0.586	0.576	0.573	0.573	0.736
					0.578	0.571	0.561	0.559	0.726
					0.586	0.573	0.561	0.559	0.732
248	5	5	3	2	0.572	0.553	0.545	0.533	0.712
					0.590	0.550	0.553	0.556	0.736
					0.572	0.553	0.545	0.533	0.712
					0.590	0.553	0.545	0.533	0.736
249	5	5	4	2	0.563	0.568	0.525	0.525	0.718
					0.569	0.569	0.560	0.561	0.725
					0.563	0.568	0.525	0.525	0.718
					0.569	0.569	0.560	0.561	0.725
250	5	5	5	2	0.564	0.575	0.537	0.546	0.729
					0.569	0.564	0.555	0.548	0.715
					0.575	0.566	0.727	0.607	0.761
					0.564	0.548	0.715	0.611	0.755

Table 6. ANOVA results for accuracy on the CIFAR-10 dataset with an imbalance ratio of 25:1.

	Degrees of Freedom	Sums of Squares	Mean Squares	F-Values	p-Values
LR	4	4172.556	1043.139	10.565	6.508 × 10⁻⁸
DR 1	4	112.928	28.232	0.245	0.912
DR 2	4	171.201	42.800	0.372	0.828
KS	1	498.492	498.492	7.373	0.036
LR*DR 1	16	1390.666	86.917	0.862	6.137 × 10⁻¹
LR*DR 2	16	871.953	54.497	0.530	9.298 × 10⁻¹
LR*KS	4	7464.556	1866.139	27.600	7.358 × 10⁻¹⁹
DR 1*DR 2	16	1145.995	71.625	0.598	0.883
DR 1*KS	4	49.896	12.474	0.108	0.979
DR 2*KS	4	104.523	26.131	0.227	0.922

Table 7. ANOVA results for accuracy on the CIFAR-10 dataset with an imbalance ratio of 15:1.

	Degrees of Freedom	Sums of Squares	Mean Squares	F-Values	p-Values
LR	4	4264.931	1066.232	15.958	1.622 × 10⁻¹¹
DR 1	4	217.170	54.292	0.812	5.182 × 10⁻¹
DR 2	4	58.650	14.662	0.213	9.309 × 10⁻¹
KS	1	155.630	155.630	7.929	4.925 × 10⁻²
LR*DR 1	16	646.592	40.412	0.604	8.784 × 10⁻¹
LR*DR 2	16	361.287	22.580	0.328	9.937 × 10⁻¹
LR*KS	4	6178.579	1544.644	38.768	5.079 × 10⁻²⁵
DR 1*DR 2	16	646.767	40.422	0.598	0.958
DR 1*KS	4	205.396	51.349	0.629	0.642
DR 2*KS	4	115.262	28.815	0.348	0.844

Table 8. ANOVA results for accuracy on the CIFAR-10 dataset with an imbalance ratio of 1:1.

	Degrees of Freedom	Sums of Squares	Mean Squares	F-Values	p-Values
LR	4	377.506	94.376	27.176	1.125 × 10⁻⁷
DR 1	4	8.587	2.146	0.431	0.786
DR 2	4	4.653	1.163	0.232	0.919
KS	1	177.729	177.729	41.954	4.982 × 10⁻¹⁰
LR*DR 1	16	19.541	1.221	0.334	9.930 × 10⁻¹
LR*DR 2	16	12.175	0.760	0.205	9.996 × 10⁻¹
LR*KS	4	523.034	130.758	209.146	5.673 × 10⁻¹⁷
DR 1*DR 2	16	4.330	0.270	0.050	0.929
DR 1*KS	4	3.162	0.790	0.182	9.472 × 10⁻¹
DR 2*KS	4	2.519	0.629	0.144	9.651 × 10⁻¹

Table 9. Performance of minority class samples in the CIFAR-10 dataset across five models.

Model	Accuracy	Recall	F1	G-Mean	P-Mean
1	0.51	0.22	0.36	0.49	0.53
2	0.49	0.16	0.27	0.40	0.86
3	0.65	0.33	0.72	0.78	0.89
4	0.67	0.27	0.71	0.76	0.94
5	0.67	0.34	0.71	0.76	0.94

Table 10. Performance metrics for each class in the CIFAR-10 dataset (bold text indicates Model 5).

	Airplane	Automobile	Bird	Cat	Deer	Dog	Frog	Horse	Ship	Truck
Recall	0.90	0.31	0.38	0.59	0.46	0.43	0.85	0.53	0.73	0.91
	0.87	0.23	0.35	0.58	0.22	0.32	0.81	0.38	0.79	0.84
	0.90	0.12	0.32	0.58	0.40	0.42	0.19	0.28	0.74	0.79
F1 score	0.68	0.71	0.49	0.57	0.54	0.60	0.77	0.69	0.82	0.77
	0.60	0.37	0.41	0.44	0.33	0.41	0.70	0.52	0.73	0.65
	0.54	0.22	0.38	0.41	0.43	0.43	0.32	0.42	0.70	0.60
G-mean	0.91	0.76	0.61	0.78	0.65	0.73	0.91	0.79	0.88	0.93
	0.88	0.48	0.58	0.72	0.47	0.56	0.88	0.61	0.87	0.88
	0.87	0.35	0.55	0.71	0.62	0.63	0.44	0.52	0.84	0.85
P-mean	0.55	0.94	0.71	0.50	0.72	0.64	0.70	0.75	0.86	0.66
	0.45	0.52	0.49	0.47	0.52	0.50	0.45	0.49	0.46	0.45
	0.37	0.46	0.41	0.39	0.40	0.40	0.44	0.42	0.38	0.37

Table 11. Performance of minority class samples on the Fashion-MNIST dataset across five models.

Model	Accuracy	Recall	F1	G-Mean	P-Mean
1	0.84	0.28	0.42	0.52	0.90
2	0.85	0.31	0.44	0.55	0.91
3	0.84	0.24	0.38	0.48	0.90
4	0.83	0.28	0.41	0.52	0.88
5	0.87	0.38	0.50	0.59	0.94

Table 12. Performance metrics for each class in the Fashion-MNIST dataset (bold text indicates Model 5).

	T-Shirt	Trouser	Pullover	Dress	Coat	Sandal	Shirt	Sneaker	Bag	Ankle Boot
Recall	0.97	0.96	0.90	0.90	0.75	0.98	0.46	0.84	0.97	0.99
	0.97	0.96	0.87	0.90	0.62	0.98	0.38	0.88	0.97	0.98
F1 score	0.78	0.97	0.75	0.89	0.73	0.98	0.53	0.95	0.97	0.98
	0.77	0.78	0.75	0.87	0.70	0.97	0.44	0.92	0.97	0.95
G-mean	0.95	0.97	0.91	0.94	0.84	0.98	0.61	0.94	0.98	0.98
	0.95	0.97	0.91	0.94	0.78	0.98	0.51	0.88	0.98	0.98
P-mean	0.77	0.77	0.78	0.77	0.81	0.77	0.95	0.78	0.78	0.77
	0.77	0.77	0.78	0.77	0.79	0.77	0.91	0.78	0.77	0.77

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zou, R.; Wang, N. Imbalanced Data Parameter Optimization of Convolutional Neural Networks Based on Analysis of Variance. Appl. Sci. 2024, 14, 9071. https://doi.org/10.3390/app14199071

AMA Style

Zou R, Wang N. Imbalanced Data Parameter Optimization of Convolutional Neural Networks Based on Analysis of Variance. Applied Sciences. 2024; 14(19):9071. https://doi.org/10.3390/app14199071

Chicago/Turabian Style

Zou, Ruiao, and Nan Wang. 2024. "Imbalanced Data Parameter Optimization of Convolutional Neural Networks Based on Analysis of Variance" Applied Sciences 14, no. 19: 9071. https://doi.org/10.3390/app14199071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Imbalanced Data Parameter Optimization of Convolutional Neural Networks Based on Analysis of Variance

Abstract

1. Introduction

2. Related Work

3. Background Models

3.1. Convolutional Neural Network

3.2. Parameters in CNN Model

3.3. Evaluation Metrics

4. Statistical Model

4.1. Fractional Factorial Design

4.2. Fractional Factorial Design

4.2.1. Fixed-Effects Model

4.2.2. Estimation of Marginal Means and 95% Confidence Intervals

5. Experimental Design

5.1. Dataset Description

5.1.1. CIFAR-10 Dataset

5.1.2. Fashion-MNIST Dataset

5.2. Fractional Factorial Design

5.3. Analysis of Variance Results

5.3.1. ANOVA Results under Different Imbalance Ratios

5.3.2. Detailed Analysis of 25:1 Imbalanced Data on the CIFAR-10 Dataset

5.3.3. Detailed Analysis of 25:1 Imbalanced Data on the Fashion-MINST Dataset

5.4. Parameter Optimization

5.4.1. Results of CIFAR-10 Dataset

5.4.2. Results of Fashion-MINST Dataset

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI