*3.2. Improvement of Reproduction Process*

In the reproduction process of the original BFO algorithm, half of the good bacteria (*S*/2) are replicated using the current bacterial position generation cost *L* as the basis for good or bad arrangement in the bacterial population with a population size of *S*, and the sub-population generated by replication replaces the other half of the bad bacteria in the original bacterial population.

Because each parent has one of the same offspring in the bacterial population with size *S* after replication, the diversity of the population is reduced. In this paper, the cost of the current bacterial location is used to rank the values as good and bad, and half of the excellent bacteria *S*/*2* are reproduced. The reproduced sub-population replaces the worse *S*/2 bacteria in the original bacterial population. In order to increase the diversity of the population and prevent the loss of the best individual, a hybrid operator is introduced into the parent individual (excluding the best parent individual) to cross with the best individual. The hybrid equation is [46]

$$
\sigma = \sigma + rand \ast (\sigma\_{\text{best}} - \sigma) \tag{8}
$$

where σ is the parent individual (excluding the best parent individual), σ*best* is the best parent individual, and *rand* is the random number with entries on [0, 1].

#### *3.3. Improvement of Elimination and Dispersal Process*

The elimination–dispersal operation helps the BFO algorithm jump out of the local optimal solution and find the global optimal solution. In the elimination–dispersal process of the original BFO, elimination and dispersal is carried out according to the given fixed probability *Ped* without considering the evolution of the population.

In this paper, the elimination–dispersal operation is improved by introducing the population evolution factor and elimination–dispersal is carried out according to the evolution of the population, which is conducive to the effectiveness of the algorithm and prevents the population from falling into a local optimum due to slow evolution. The formula of the population evolution factor *fevo* is

$$f\_{\rm fero} = \frac{L\_{\rm geu} - L\_{\rm geu-1}}{L\_{\rm gren-1} - L\_{\rm gren-2} + rand} \tag{9}$$

where *Lgen* represents the optimal generation cost at the iteration *gen* and *rand* prevents the denominator from being 0. In this paper, (1 − *fevo*) is used to replace *Ped* as in the original BFO algorithm. When *fevo* > 1, the evolution is accelerated. At this time, the evolution degree of the population is faster and the population is in a fast and effective optimization state. Elimination–dispersal with a lower elimination–dispersal probability (1 − *fevo*) can retain the current favorable location information. When 0 ≤ *fevo* < 1, the evolution slows down. When the evolution degree of the population is slow, the population falls into a local optimum to a large extent. It is necessary for elimination–dispersal with a high elimination–dispersal probability (1 − *fevo*) to jump out of the local optimum solution so as to prevent the population from not evolving.

In order to overcome the shortcoming of the BFO algorithm easily falling into a local optimum and uncertain orientation during the chemotaxis process, PSO is incorporated into the BFO algorithm in this paper, that is to say, PSO is added to the chemotaxis process of each individual bacterium, which is the cost of each bacterium according to PSO. For the improved chemotaxis process, PSO is performed to obtain the updated location of the θ*<sup>i</sup>* . The procedure of the proposed algorithm is detailed as follows.


(8) If the maximum number of BFO iterations is met, the algorithm is over. Finally, we output the classification accuracy results in this implementation.

The proposed algorithm is performed and cost *L* is defined as the classification accuracy. This experiment used a classification accuracy based on the confusion matrix, which can test the performance of the classification method. The confusion matrix is shown in Table 3.


**Table 3.** The confusion matrix.

TP and FP represent the true positive class and the false positive class, respectively; FN and TN represent the false negative class and the true negative class, respectively. When the predicted value is a positive example, it is recorded as P (positive). When the predicted value is a negative example, it is is recorded as N (negative). When the predicted value is the same as the actual value, it is recorded as T (true). Finally, when the predicted value is opposite to the actual value, it is is recorded as F (false). The four results of defining examples in the data set after model classification are TP: predicted positive, actual positive actual; FP: predicted positive, actual negative; TN: predicted negative, actual negative; and FN: predicted negative, actual positive. The classification accuracy calculation formula is

$$\text{Classification accuracy} = (\text{TP} + \text{TN})/(\text{TP} + \text{FN} + \text{FP} + \text{TN}) \times 100\% \tag{10}$$

The receiver operating characteristic curve (ROC curve) and area under the curve (AUC) can test the performance of the classification results. This is because the ROC curve has a favorable characteristic: when the distribution of positive and negative instances in the test dataset changes, the ROC curve can remain unchanged. Class imbalance often occurs in the actual data set, i.e., there are many more negative instances than positive instances (or vice versa) and the distribution of positive and negative instances in the test data may change with time. The area under the ROC curve is calculated as the evaluation method of imbalanced data. It can comprehensively describe the performance of classifiers under different decision thresholds. The AUC calculation formula is

$$\text{Area Under the Curve (AUC)} = \frac{1 + \left(\frac{\text{TP}}{\text{FP} + \text{FN}}\right) - \left(\frac{\text{FP}}{\text{TN} + \text{FP}}\right)}{2} \tag{11}$$

#### **4. Simulation Results and Discussion**

In this study, our purpose was to obtain an effective algorithm with which to improve the accuracy of imbalanced data. In order to verify the performance of the proposed algorithm, ovarian cancer microarray data, a spam email dataset and a zoo dataset are used for simulation experiments. The Borderline-SMOTE and Tomek link approaches are used for preprocess data to increase the numbers of minority classes until they are the same number as the majority class. In the simulation experiment, some parameters of the algorithm need to be determined. In this experiment, the BFO algorithm parameters were set as *S* = 50, *Nc* = 100, *Ns* = 4, *Nre* = 4, *Ned* = 2, *Ped* = 0.25, *xattract* = 0.05, *xrepellent*= 0.05, *yattract* = 0.05, *yrepellent* = 0.05, α*(i)* = 0.1, and *i* = 1, 2, ... *S*. The number of BFO iterations was *Nc* × *Nre* × *Ned* = 100 × 4 × 2 = 800. This study evaluated the results when adopting 10-fold cross validation with random partitions. The maximum number of PSO iterations was set to 5000 and the other parameters were set as inertia weight *w* = 0.6, learning factors *c*<sup>1</sup> = *c*<sup>2</sup> = 1.5, and maximum velocity of each particle *vmax*= 2 [47].

The parameter value of the algorithm is the key to the performance and efficiency of the algorithm. In evolutionary algorithms there are no general methods for determining the optimal parameters of the algorithm. Most parameters are selected by experience. There are many BFO and PSO parameters. Knowing how to determine the optimal BFO and PSO parameters to optimize the performance of the algorithm is a very complex optimization problem. In the parameter setting of PSO and BFO, in order to jump off the local solution to find the global solution without spending a lot of calculation time, we used empirical values.

#### *4.1. Comparing and Analyzing the Classification Accuracy of the Proposed Algorithm and Other Methods*



**Table 4.** The classification accuracy for microarray data of ovarian cancer. Legend: RF, random forest; SVM, support vector machine; DT, decision tree; KNN, k nearest neighbor.


**Table 6.** The classification accuracy for the zoo dataset.


#### *4.2. Analysis of ROC and AUC*

In this experiment, the area below the ROC is also called the AUC and is used to evaluate the performance of the proposed approach. The value of the AUC is from 0 to 1.0, and the closer to 1.0, the better the effect of the model classifier. The value of the AUC is 0.979 for the ovarian cancer microarray data, as shown in Figure 4. The value of the AUC is 0.987 for the spam email dataset, as shown in Figure 5. The value of the AUC is 0.995 for the zoo data, as shown in Figure 6. Hence, the experimental results show that the proposed algorithm has good classification performance.

**Figure 4.** The receiver operating characteristic (ROC) and the area under the curve (AUC) for the microarray data of ovarian cancer.

**Figure 6.** The ROC and AUC for the zoo dataset.

#### **5. Conclusions**

This paper has proposed the incorporation of particle swarm optimization into an improved bacterial foraging optimization algorithm applied to the classification of imbalanced data. The Borderline-SMOTE and Tomek link approaches were used to pre-process data. Thereafter, the intelligent improved BFO was applied to the classification of imbalanced data so as to solve the shortcoming of falling into a local optimum in the original BFO algorithm. Three datasets were used for testing the performance of the proposed algorithm. The proposed algorithm includes an improved chemotaxis process, an improved reproduction process, and an improved elimination and dispersal process. In this paper, the global search ability of the BFO was improved by using particles to search and then treating particles as bacteria in the improved chemotaxis process. After the improved chemotaxis, the swarming operations, improved reproduction operations, and improved elimination and dispersal operations were performed. The average classification accuracy of the proposed algorithm for the ovarian cancer microarray data was 93.47%. The average classification accuracies of the spam email dataset and the zoo dataset of the proposed algorithm were 96.42% and 99.54%, respectively. The value of the AUC was 0.979 for the ovarian cancer microarray data, 0.987 for the spam email dataset, and 0.995 for the zoo dataset. The experimental results showed that the proposed algorithm in this research can achieve the best accuracy in the classification of imbalanced data compared with existing approaches.

In this paper, PSO was introduced into an improved bacterial foraging optimization algorithm and applied to the classification of imbalanced data. Based on the research results, we put forward the following suggestions:


**Author Contributions:** Methodology, F.-L.Y. and C.-Y.L.; software, F.-L.Y., J.-Q.H., and J.-F.T.; formal analysis, F.-L.Y., C.-Y.L., and Z.-J.L.; investigation, F.-L.Y., C.-Y.L., and J.-F.T.; resources, C.-Y.L. and Z.-J.L.; data curation, F.-L.Y., J.-Q.H., and J.-F.T.; original draft preparation, F.-L.Y., C.-Y.L., and J.-F.T.; review and editing, F.-L.Y., C.-Y.L., and J.-F.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** This research was supported by the Major Education and Teaching Reform Projects in Fujian Undergraduate Colleges and Universities in 2019 under grant FBJG20190284. This work was also supported by projects under 2019-G-083.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
