**1. Introduction**

In machine learning the imbalanced distribution of categories is called an imbalanced problem. When conventional algorithms are directly applied to this problem, the classification results tend to be biased towards most classes, resulting in a few classes not being correctly identified. Moreover, most of the traditional algorithms train classifiers based on the maximization of overall accuracy, meaning they ignore the misclassification of a few samples, thus affecting the classification results of traditional classifiers [1–3]. However, in many practical applications, a few samples are often more valuable than most samples, such as in bank fraud user identification, medical cancer diagnosis, and network hacker intrusion [4–9].

Imbalanced data mining is an important problem in data mining. Various algorithms, including k nearest neighbor (KNN), decision tree (DT), artificial neural network (ANN), and the genetic algorithm (GA), have been recommended for data mining [10–17]. However, these algorithms usually assume that datasets are distributed evenly among different classes and that some classes may be ignored. In the literature, some methods for dealing with imbalanced data have been proposed. These methods include adjusting the size of training datasets, cost-sensitive classifiers, and snowball methods [18–20]. These methods may result in the loss of information in general rules and the incorrect classification of additional classes. Ultimately, they can lead to an over-matching of data and poor performance due to

having too many specific rules. Traditional optimization methods can no longer solve the complex problems faced by many datasets. In recent years, people have proposed a hybrid intelligent system to improve the accuracy of data mining rather than use a separate method. The hybrid method combines the best results of various systems to improve the accuracy [21–23].

Particle swarm optimization (PSO) was first invented by Dr. Eberhart and Dr. Kennedy [24,25]. It is a population-based heuristic algorithm used for simulating social behavior, such as birds clustering to promising locations, in order to find accurate targets in multi-dimensional space. PSO uses groups of individuals (called particles) to perform searches as with evolutionary algorithms, and particles can be updated from each iteration to the other [26–30]. In order to find the optimal solution, each particle changes its search direction based on two factors: its best previous location (*pbest*) and all other members' best locations (*gbest*) [31–34]. Shi et al. called *pbest* the cognitive part and *gbest* the social part [35].

The bacterial foraging optimization (BFO) algorithm is a bionic intelligent algorithm which was proposed by Passino in 2002 according to *Escherichia coli* in the human intestine [36,37]. The bacterial foraging chemotaxis process makes its local search ability stronger, but the global search ability of bacteria foraging can only be achieved by elimination and dispersal, and the global search ability is not strong enough to be limited by elimination and dispersal probability; thus it easily to falls into a local search optimal problem. In this paper, the incorporation of particle swarm optimization into an improved bacterial foraging optimization algorithm applied to the classification of imbalanced data is proposed. The borderline synthetic minority oversampling technique (Borderline-SMOTE) and Tomek link are used to pre-process imbalanced data. Thereafter, the proposed algorithm is used to classify imbalanced data.

Because PSO has a strong global search ability, individual effect, and group effect, PSO is incorporated into the improvement of the chemotaxis process of the improved BFO algorithm. The proposed algorithm improves the global searching ability and efficiency through the strong global search ability of PSO. In addition to embedding PSO into the BFO algorithm's chemotaxis process to improve the BFO algorithm's vulnerability to local optimization, in the improved replication operation, the crossover operator is introduced into the replication parent to increase the diversity of the population, while retaining the best individual. In the improved elimination and dispersion operation, the population evolution factor *fevo* is proposed, and (1 − *fevo*) is introduced to replace the *Ped* in the original BFO algorithm so as to prevent the population from falling into a local optimum and achieving evolution stagnation. The purpose of this study was to improve the classification accuracy of ovarian cancer microarray data and to improve the practicability and accuracy of doctors' judgment of ovarian cancer microarray data.

This paper is organized as follows: Section 2 reviews PSO and BFO. Section 3 shows the proposed algorithms. Section 4 presents the experimental results and discussion. This section also describes an in-depth comparison of the proposed algorithm with other methods. Finally, a conclusion is given.

#### **2. A Brief Description of Bacterial Foraging Optimization and Particle Swarm Optimization**

In this paper, the bacterial foraging optimization algorithm is improved. Firstly, PSO is incorporated into the BFO chemotaxis process to improve the chemotaxis process. For this reason, this section introduces the basic concepts of bacterial foraging optimization and particle swarm optimization.

#### *2.1. Bacterial Foraging Optimization*

Passino introduced bacteria foraging optimization as a solution to distributed optimization and control problems. It is an evolutionary algorithm and a global random search algorithm. The BFO algorithm mainly solves the optimization problem by using four process iterative calculations: chemotaxis, swarming, reproduction, elimination, and dispersal [38]. In the chemotaxis process, there are two basic movements of *E. coli* in the process of foraging, namely, swimming and tumbling. Usually, in areas with poor environmental conditions (for example, toxic areas), bacteria may tumble

more frequently, and in areas with a good environment, they will swim more often. Let *P*(*j*, *k*, *l*) = θ*i* (*j*, *k*, *l*) *<sup>i</sup>* <sup>=</sup> 1, 2, ... *<sup>S</sup>* indicate the *i*th bacterium in the population of the *S* bacteria at the *j*th chemotaxis process, *k*th reproduction process, and *l*th elimination and dispersal process. Let *L*(*i*, *j*, *k*, *l*) be the cost at the location θ(*j*, *k*, *l*) of the *i*th bacterium. When the bacterial population size is *S* and *Nc* is the length of the bacteria in one direction of the chemotactic operation, the chemotaxis operation of each step of the *i*th bacterium is expressed as

$$\theta^{\dot{i}}(\text{ }j+1,\text{ }k,\text{ }l) = \theta^{\dot{i}}(\text{ }j,\text{ }k,\text{ }l) + \alpha(\text{ }i)\frac{\delta(i)}{\sqrt{\delta^T(i)\delta(i)}}\tag{1}$$

where α(*i*) > 0 represents the step unit of the forward swimming and δ(*i*) represents a unit vector in the random direction vector after the tumbling. In the swarming process, in addition to searching for food in their own way, each bacterial individual receives an appeal signal from other individuals in the population; that is, the individual will swim to the center of the population and will also receive a repulsive force signal from nearby individuals to maintain a safe distance between it and other individuals. Hence, the decision-making behavior of each bacterial individual in BFO which finds food is affected by two factors. The first is its own information, that is, the purpose of individual foraging to maximize the energy acquired by the individual in unit time, and the other is information from other individuals, that is, foraging information transmitted by other bacteria in the population. The mathematical expression is described as

$$\begin{aligned} \mathcal{L}\_{\text{cc}}(\boldsymbol{\theta}, \boldsymbol{P}(\boldsymbol{j}, \boldsymbol{k}, \boldsymbol{l})) &= \sum\_{i=1}^{s} L\_{\text{cc}}^{i} \Big( \boldsymbol{\theta}, \,\, \theta^{i}(\boldsymbol{j}, \boldsymbol{k}, \boldsymbol{l}) \Big) \\ \boldsymbol{\theta} &= \sum\_{i=1}^{s} \Big[ -\mathsf{x}\_{\text{attract}} \exp(-\mathsf{y}\_{\text{attract}} \, \sum\_{m=1}^{p} \left( \boldsymbol{\theta}\_{m} - \boldsymbol{\theta}\_{m}^{i} \right)^{2} \Bigg] \\ &+ \sum\_{i=1}^{s} \Big[ -\mathsf{x}\_{\text{repellent}} \, \exp(-\mathsf{y}\_{\text{repellent}} \, \sum\_{m=1}^{p} \left( \boldsymbol{\theta}\_{m} \boldsymbol{\theta}\_{m}^{i} \right)^{2} \Bigg] \end{aligned} \tag{2}$$

where *Lcc*(θ, *P*( *j*, *k*, *l*)) denotes the penalty for the actual cost function, *S* is the number of bacteria, θ*<sup>m</sup>* is the location of the fittest bacterium, and *xattract*, *xrepellent*, *yattract*, and *yrepellent* are different coefficients. The swarming process is minimized mathematically.

$$L\_{\rm sw}(i, \; j, \; k, \; l) = L(i, \; j, \; k, \; l) + L\_{\rm cr}(\; \partial \; \; P(\; j, \; k, \; l)) \tag{3}$$

In the swarming process, the number of biologically-motivated choices is expressed as *Ns*. In the reproduction process, according to the strength of the foraging ability of the bacteria, the appropriate cost *L* is selected; that is, *L* ranks the sum of the cost of all the locations experienced by the *i*th bacteria in the chemotaxis operation, and the elimination ranks 50% later. The number of bacteria in the population, the reproduction process of the remaining bacteria, and the new individuals generated by themselves which are identical to themselves have the same foraging ability and the same location, and the replication operation maintains the invariance of the population size. After *Nre* reproduction steps the elimination and dispersal process occurs, where *Ned* is the number of steps of elimination and dispersal. These operations occur with a certain probability *Ped*. When the individual bacteria meet the probability *Ped* of elimination and dispersal, the individual dies and randomly generates a new individual at any location in the solution space. These new bacteria may have different bacterial foraging capabilities than the original bacteria, conducive to jumping out of the local optimal solution. A flow diagram of bacteria foraging optimization is presented in Figure 1.

**Figure 1.** A flow diagram of bacterial foraging optimization (BFO).

#### *2.2. Particle Swarm Optimization*

PSO is a bionic algorithm used for the study of birds searching for food in nature. It regards birds as a particle in space, and a bird swarm is subject to PSO [39,40]. A single particle carries corresponding information—i.e., its own velocity and location—and determines the distance and direction of its motion according to the corresponding information of the particle itself. The PSO is used to initialize a group of particles which are randomly distributed into a solution space to be searched and then iterated according to a given equation. The equation of the mature particle swarm optimization algorithm includes two optimum concepts. The first is the local optimum *pbest* and the other is the global optimum *gbest*. The local optimum is the best solution obtained by each particle in the search, and the global optimum is the best solution obtained by this particle swarm. The PSO algorithm has the characteristics of memory, using positive feedback adjustment; the principle of the algorithm is simple, the parameters are few, and the applicability is good. The formulae of PSO are Equations (4) and (5), as described.

$$w\_{i}^{t+1} = wv\_{i}^{t} + c\_{1} \times rand\_{1}^{t} \times \left(\mathbf{x}\_{i}^{p\_{\text{best}}} - \mathbf{x}\_{i}^{t}\right) + c\_{2} \times rand\_{2}^{t} \times \left(\mathbf{x}^{\text{best}} - \mathbf{x}\_{i}^{t}\right) \tag{4}$$

$$\mathbf{x}\_{i}^{t+1} = \mathbf{x}\_{i}^{t} + \mathbf{v}\_{i}^{t+1} \tag{5}$$

In Equation (4), *vt <sup>i</sup>* and *vt*<sup>+</sup><sup>1</sup> *<sup>i</sup>* denote the velocity of the *<sup>i</sup>*th particle in iterations *t* and *t* + 1, *w* is the inertia weight, *c*<sup>1</sup> and *c*<sup>2</sup> are learning factors, *rand<sup>t</sup>* <sup>1</sup> and *randt* <sup>2</sup> are random numbers between [0, 1] in iteration *t*, *x pbest <sup>i</sup>* is the best location of the *<sup>i</sup>*th particle, and *<sup>x</sup>gbest* is the best location of fitness found by all particles in the population. In Equation (5), *xt <sup>i</sup>* and *<sup>x</sup>t*+<sup>1</sup> *<sup>i</sup>* denote the location of the *i*th particle in iterations *t* and *t* + 1. A flow chart of PSO is shown in Figure 2.

**Figure 2.** A flow chart of the particle swarm optimization (PSO) algorithm.

#### **3. The Proposed Algorithm**

In this paper, the incorporation of particle swarm optimization into an improved bacterial foraging optimization algorithm applied to the classification of imbalanced data is proposed. Three datasets are used for testing the performance of the proposed algorithm. One consists of ovarian cancer microarray data, and the other two, obtained from the UCI repository, are a spam email dataset and zoo dataset. The ovarian cancer microarray data were obtained from Taiwan's university. There are 9600 features in the microarray data of ovarian cancer, which were collected from China Medical University Hospital, with an imbalance ratio of about 1:20 [41,42]. The instances of microarray data we used included

ovarian tissue, vaginal tissue, cervical tissue, and myometrium, including six benign ovarian tumors (BOT), 10 ovarian tumors (OVT), and 25 ovarian cancers (OVCA). The spam email dataset and zoo dataset were obtained from the UCI repository [43]. For the spam email dataset, there were 4601 emails with 58 features, as shown in Table 1, and the imbalance ratio was about 1:1.54. For the zoo dataset, there were 101 instances with 17 features, as shown in Table 2, and the imbalance ratio was about 1:25.


**Table 1.** The 58 features of the spam email dataset.

**Table 2.** The 17 features of the zoo dataset.


Figure 3 shows a flow chart of the proposed algorithm. In Figure 3, the used parameters are set first. The approaches of the Borderline-SMOTE and Tomek link are used for pre-process data. Thereafter, the improved BFO algorithm is applied to classify imbalanced data so as to solve the shortcoming of falling into a local optimum in the original BFO algorithm.

In order to over-sample the minority instances, the Borderline-SMOTE is designed in the proposed algorithm; the main idea of SMOTE is to balance classes by generating synthetic instances from the minority class [44]. For the subset of minority instances *mi*, k nearest neighbors are obtained by searching. The k nearest neighbors are defined as the smallest distance between the Euclidean distance and *mi*, and n synthetic instances are randomly selected from them which are recorded as *Yj*, *j* = 1, 2, ... , n. This is done to create a new minority instance as in Equation (6) as described, where *rand* is the random number between [0, 1].

$$m\_{new} = m\_i + rand \* \left( \begin{array}{c} \begin{array}{c} \begin{array}{c} \begin{array}{c} \\ \end{array} \end{array} \right. \end{array} \right) \tag{6}$$

In the proposed algorithm, as a data cleaning technology, the Tomek link is effectively applied to eliminate the overlap in the sampling method [45]. The Tomek link is used to remove unnecessary overlaps between classes until the nearest neighbor pairs at the minimum distance belong to the same class. Suppose that the nearest neighbors (*mi*, *mj*) of a pair of minimal Euclidean distances belong to

different classes. *d*(*mi*, *mj*) represents the Euclidean distance between *mi* and *mj*. If there is no instance *ml* satisfying Equation (7), we call (*mi*, *mj*) a pair of Tomek link.

**Figure 3.** A flow diagram of the proposed algorithm.

In this paper, the parameter k used for SMOTE was set to k = 3. After preprocessing data, the solution of location θ*<sup>i</sup>* was generated. Thereafter, the improved BFO algorithm was performed. Aiming at the BFO algorithm shortcoming of falling into a local optimum, we propose the incorporation of particle swarm optimization into an improved bacterial foraging optimization to solve these problems. An improved BFO proposed algorithm improves the chemotaxis process, reproduction process, and the elimination and dispersal process.

### *3.1. Improvement of Chemotaxis Process*

The original BFO algorithm mainly searches within the process of chemotaxis. When the chemotaxis searches the target area, the swimming and tumbling operation of the chemotaxis process directly affects the effect of the algorithm. While a large swimming step makes the global search ability strong, a small swimming step makes the local search ability strong. Because of the characteristics of chemotaxis, the BFO algorithm has good local search ability because it can change direction in chemotaxis, meaning the local search accuracy is very good. However, the global search ability of bacteria can only rely on the elimination and dispersal operation process, and its global search ability is not good.

Because PSO has strong memory and global search ability, individual effect, and group effect, in this paper, the PSO is incorporated into the chemotaxis process of the original BFO so as to solve the problem of how the original BFO algorithm easily falls into local optimization. By using particles to search first and then treat particles as bacteria, the global search ability of the original BFO algorithm is improved. The purpose of this study is to find an effective algorithm which combines the advantages of PSO, including fast convergence speed, strong search ability, and the good classification effect of the BFO algorithm, to improve the accuracy of imbalanced data.
