This section provides the experimental setups, benchmarks, and steps that were used throughout the experiments, along with the results that were obtained and their analysis.
7.2. Effects of Fitness Function
We extended the MHOANN to add in the costs of misclassified instances during model training by implementing a cost sensitivity fitness function, which was based on the confusion matrix that was described in
Section 5. For the problem in question, we tried to avoid FN predictions, i.e., the model predicts that a company is financially stable while it is actually in financial distress. Hence, we assigned a weighted cost to FN predictions. Determining the proper weight for the FN predictions depended on the dataset and the algorithm that were being used. We accomplished this by experimenting with different weights while monitoring the metrics to determine the best weight to use. Since the datasets in this work were relatively small in size, we were able to experiment using the whole datasets; however, in real applications with large datasets, we recommend using a sample of the dataset to find the best weight to use in order to reduce the computational costs. We considered the weight that yielded the highest g-mean score for the subsequent experiments.
Table 2 shows the results for the dataset of Spanish companies with the PSO,
Table 3 shows the results for the dataset of Spanish companies with the CSO,
Table 4 shows the results for the dataset of Taiwanese companies with the PSO,
Table 5 shows the results for the dataset of Taiwanese companies with the CSO,
Table 6 shows the results for the dataset of Polish companies with the PSO, and
Table 7 shows the results for the dataset of Polish companies with the CSO.
From these experiments, we observed that the best weight for FN predictions when using the PSO for the dataset of Spanish companies was 100, as shown in
Figure 6. The corresponding result was 75 when using the CSO for the same dataset, as shown in
Figure 7. On the other hand, we noticed that the best weight for FN predictions when using the PSO for the dataset of Taiwanese companies was 50, as shown in
Figure 8. The result was the same when the CSO was used for the same dataset, as shown in
Figure 9. We also noticed that the best weight for FN predictions when using the PSO for the dataset of Polish companies was 175, as shown in
Figure 10. The result was the same when using the CSO for the same dataset, as shown in
Figure 11. After determining the best FN weight for each particular optimization algorithm and dataset, we trained the MHOANN using the cost sensitivity fitness function, fed it with the corresponding FN weight, and then used the trained model to classify the instances in the testing dataset. We observed that in order to obtain reasonable g-mean scores, the weight of the FN predictions needed to be considerably high, from 50 to 175. This could be explained by the extreme imbalance of the data in the considered datasets.
To assess the effects of the cost sensitivity fitness function, we based our results on a benchmark. In the benchmark, we used each optimizer (PSO and CSO) with two different fitness functions, MSE and accuracy, and then trained the ANN using both datasets to observe the evaluation metrics without cost-sensitive learning. For each dataset, we executed four experiments: the ANN with the PSO and MSE as the fitness function, the ANN with the PSO and accuracy as the fitness function, the ANN with the CSO and MSE as the fitness function, and the ANN with the CSO and accuracy as the fitness function. The averages and standard deviations were calculated, along with the best scores for each metric.
Table 8 shows the results for all of the fitness functions that were applied to the dataset of Spanish companies. In
Table 9, the results from all of the fitness functions that were applied to the dataset of Taiwanese companies are illustrated. In
Table 10, the results from all of the fitness functions that were applied to the dataset of Polish companies are shown. The cost-sensitive MHOANN showed major improvements when predicting the minority classes, which had a major positive impact on the g-mean and F1 score metrics and a negative impact on the accuracy.
Using the dataset of Spanish companies, when comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, we noticed a major improvement in the g-mean from to , an improvement in the F1 score from to , and a drop in the accuracy from to . When comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, we observed similar results: a major increase in the g-mean from to , an improvement in the F1 score from to , and a drop in the accuracy from to . Similarly, when comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, we noticed a major increase in the g-mean from to , an improvement in the F1 score from to , and a drop in the accuracy from to . When comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, we also observed a major increase in the g-mean from to , an improvement in the F1 score from to , and a drop in the accuracy from to .
We also observed similar results while using the dataset of Taiwanese companies. When comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, we noticed a major increase in the g-mean from to , an improvement in the F1 score from to , and a drop in the accuracy from to . When comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, we also noticed a major increase in the g-mean from to , an increase in the F1 score from to , and a drop in the accuracy from to . When comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, the increase in the g-mean was from to , the increase in the F1 score was from to , and the drop in the accuracy was from to . Likewise, when comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, the increase in g-mean was from to , the increase in the F1 score was from to , and the drop in the accuracy was from to .
Moreover, we observed similar results while using the dataset of Polish companies. When comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, we noticed a major increase in the g-mean from to , an improvement in the F1 score from to , and a drop in the accuracy from to . When comparing the ANN with the PSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, we also noticed a similar increase in the g-mean from to , an increase in the F1 score from to , and a drop in the accuracy from to . When comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with MSE as the fitness function, the increase in the g-mean was from to , the increase in the F1 score was from to , and the drop in the accuracy was from to . Likewise, when comparing the ANN with the CSO and the cost sensitivity fitness function to the same classifier with accuracy as the fitness function, the increase in the g-mean was from to , the increase in the F1 score was from to , and the drop in the accuracy was from to .
We could see that by applying the weight of the FN predictions, the number of TP instances increased, which explained the improvements in the g-mean and F1 score values. However, it also caused an increase in the number of FP instances, which explained the decrease in the accuracy score. Next, we used majority voting ensemble learning to decrease the number of FP instances while maintaining the number of TP instances.
Additionally, since the PSO and CSO produced similar results, an interesting observation was that a light optimizer with a simple mechanism to update the particles within the search space, such as the CSO, could achieve similar results when used as an optimizer for an MHOANN.
Another observation was that, whereas the PSO and CSO produced similar results when using similar fitness functions, the CSO was better in terms of execution time. Using the same population size of
and the same number of iterations (100), CSO was
faster for the dataset of Spanish companies,
faster for the dataset of Taiwanese companies, and
faster for the dataset of Polish companies.
Table 11 lists the actual execution times in seconds.
In this work, as discussed in
Section 7.2, we noticed a direct relationship between the weight of the FN predictions and the set of metrics that were monitored. While we chose the weight that produced the best g-mean score, which meant a weight that produced a balance between sensitivity and specificity, a lower weight could produce a better specificity score and a higher weight could produce a better sensitivity score, depending on which metric the user focused on.
7.4. Comparison to Other Approaches
In [
34], the authors proposed a hybrid method that combined the synthetic minority oversampling technique with other ensemble methods. Additionally, the authors applied five different feature selection methods to determine the most dominant attributes of insolvency prediction using the same dataset of Spanish companies. First, the authors compared four oversampling methods and then applied the C4.5 decision tree classifier to determine the best method. SMOTE was subsequently selected since it produced the best results, as suggested by the authors. Second, the authors compared several standard basic and ensemble classification algorithms as the baseline for the study.
Table 15 shows the g-mean scores when using the standard classifiers in [
34] compared to those when using the two methods that are proposed in this work. It can be seen that the proposed methods produced higher g-mean scores than all of the other classifiers in the related study. Third, the authors compared several basic and ensemble classification algorithms after applying oversampling using SMOTE to compare their performances and select the best performing classifier. The AB-Rep tree was subsequently selected as the best classifier. Finally, the authors applied different attribute selectors for feature selection and then applied oversampling using SMOTE and classification using the AB-Rep tree algorithm before comparing the results.
Table 16 shows the best results, based on the g-mean scores in [
34] and those of the two methods that are proposed in this work. It is clear that the proposed methods significantly improved the g-mean scores. According to these results, we noticed the benefits of applying cost-sensitive learning to our MHOANN, as well as the advantages that could be gained by using ensemble learning to improve financial distress prediction. Although the same dataset was used in this work and in [
34], it is worth mentioning that there were some differences between the experiment setups: (1) in this work, we used a 66% to 34% split for the training and testing datasets, while the authors of [
34] used a 10-fold cross-validation technique that meant that 90% of their data were used to train the model, but the approach that is proposed in this work still showed better results; (2) ten separate runs were performed in [
34] for each combination, while we performed five separate runs per combination in this work.
In another study that used the dataset of Taiwanese companies [
35], the authors established that the integration of financial ratios (FRs) and corporate governance indicators (CGIs) could enhance the performance of the classifiers when forecasting the financial health of Taiwanese firms. Following this combination, five feature selection methodologies were evaluated to see whether they could lower data dimensionality. Consequently, the best results were achieved using an SVM with the stepwise discriminant analysis (SDA) feature selection method, along with the combination of FRs and CGIs (FC). The g-mean was not used as an evaluation metric in that study. Instead, type I and type II errors were used.
A type I error [
53] is also known as the False Positive Rate (FPR). In binary classification tasks, the FPR quantifies the proportion of false positives among all of the positive samples. It is defined in Equation (
15):
A type II error [
53] is also known as the False Negative Rate (FNR). In binary classification tasks, the FNR quantifies the proportion of false negatives among all of the negative samples. It is defined in Equation (
16):
Hence, the g-mean score could be extracted using Equation (
17):
Table 17 shows the best results for the calculated g-mean scores using the type I and type II errors in [
35] and the two methods that are proposed in this work. It can be seen that both of the proposed methods produced higher g-mean scores.
7.5. Analysis and Discussion
The results from our experiments indicated that for highly imbalanced datasets, the proposed method had a significant positive impact on the g-mean score (which measures the balance between the classification performances for both the majority and minority classes) while maintaining an acceptable accuracy score. We found that the cost sensitivity fitness function helped to shift the bias away from the majority class and toward the minority classes and that ensemble learning could help to decrease the side effects of that bias shift.
In line with our hypothesis, applying a weight to the misclassified positive instances increased the number of TP predictions and decreased the number of FN predictions. However, as a side effect, the number of FP predictions increased and the number of TN predictions decreased. Since we were dealing with highly imbalanced datasets, the number of instances that belonged to the minor class was much lower than the number of instances that belonged to the major class ; so, the improvement in the sensitivity score was significant and the drop in the specificity score was not as drastic, which led to an overall improved g-mean score, as observed in the results from all experiments.
Moreover, when applying ensemble learning, we observed an overall improvement in all of the evaluation measurements that were used. This proved that the MHOANN was diverse and could be used in a homogeneous ensemble learning system. The ensemble learning created a stronger learner that approximately maintained the number FN predictions but decreased the number of FP predictions, resulting in a slightly better g-mean score and a significant improvement in the accuracy score.
In terms of performance, as previously mentioned, the CSO outperformed the PSO regarding execution time. In contrast to the PSO, only half of the population was updated in the CSO, which explained the faster execution times.
In
Appendix A, we show the convergence (learning) curve graphs for sample runs using both optimizers (the PSO and CSO) for each fitness function and each dataset. We noticed that the fitness values were minimal in the cases of the MSE and accuracy fitness functions, which indicated that the model had a high accuracy (as confirmed by the previous results) but was biased toward the majority class and failed to predict the minority classes (as previously discussed). On the other hand, the fitness value was higher when using the cost sensitivity fitness function, which was expected because the number of FN predictions was multiplied by the allocated weight. Additionally, in all of our experiments, the fitness scores stabilized when approaching 100 iterations, which indicated that additional training would not significantly improve the model.