1. Introduction
Quantitative structure–activity relationship (QSAR) plays a vital role in drug design and discovery [
1]. It aims to build the relationship between molecular structure properties of chemical compounds and their corresponding biological activities [
2]. In QSAR modeling, the structure properties of the chemical compounds are encoded by a variety of features (molecular descriptors) such as topological, constitutional, thermodynamic parameters. QSAR models can be defined as regression or classification models by using different computational strategies [
3]. The features are related with biological activities by using statistical methods or artificial intelligence approaches, such as Multiple Linear Regression (MLR) [
4], Support Vector Regression (SVR) [
5], Boosted Tree [
6], and Partial Least Squares (PLS) regression [
7], etc. In particular, machine learning methods have become extensively used in this field during the last few years [
8,
9,
10,
11,
12,
13,
14,
15]. These methods effectively improve the accuracy of QSAR modeling to a certain extent.
However, several computational issues must be addressed when QSAR models are inferred by machine learning methods. One of these problems is to address the complexity of data sets for selection of appropriate features important for defining a particular QSAR model. Specifically, not all features are related to the activity, and the redundant or irrelevant features may cause over-fitting or weak correlation [
16]. The optimal feature subset with only related and non-redundant features increases the accuracy of prediction and the interpretability of the QSAR model. Thus, feature selection (FS) which selects an optimal subset of all features is a vital pre-processing step in QSAR studies to increase the interpretability and improve the prediction accuracy [
17].
In principle, feature selection is an NP-hard combination problem. For a search space with D dimensions, the number of subsets to search is . In other words, the search space increases exponentially as the dimension of the given problem grows, hence it is intractable for limited computational resources.
Evolutionary Computation (EC) techniques are optimization methods inspired by scientific understanding of natural or social behavior, which can be regarded as search procedures at some abstraction level [
18]. In general, these algorithms can be classified as either Evolutionary Algorithms (EAs) or Swarm Intelligence (SI) algorithms [
19]. EAs start by randomly generating a set of candidate solutions, iteratively combine these solutions, and implement survival of the fittest until an acceptable solution is achieved. The classic EAs include Genetic Algorithm (GA) [
20], Differential Evolution (DE) [
21], Biogeography-Based Optimization (BBO) [
22], and Genetic Programming (GP) [
23], etc. SI algorithms start with a set of individuals, and a new set of individuals is created based on historical information and related information in each iteration. A considerable number of new SI algorithms have emerged, such as Ant Colony Algorithm (ACO) [
24], Bat Algorithm (BA) [
25], Firefly Algorithm (FA) [
26], Cuckoo Search (CS) [
27], Coyote Optimization Algorithm (COA) [
28], and Social Network Optimization (SNO) [
29].
SI is a relatively new category of evolutionary computation comparing with EAs and other single-solution based approaches and has paid sufficient attention to feature selection due to its potential global search ability. In particular, the interaction between features can be considered in the screening process, which breaks through the shortcomings of traditional feature selection algorithms. The surveys [
30,
31] have presented the proven usage of SI algorithms for FS.
The Artificial Bee Colony (ABC) [
32] algorithm, which simulates the intelligent foraging behavior of a honeybee swarm, is one of the most well-known SI algorithms. Karaboga et al. concluded that, although ABC uses fewer control parameters, it performs better than or at least comparable to other typical SI algorithms [
33]. Ozger et al. [
34] carried out a comparative study on different binary ABC algorithms on feature selection. BitABC [
35] employs bitwise operators such as AND, OR, XOR to generate new candidate solutions, and the binary ABC algorithm uses different functions to convert continuous vector to binary vectors, such as rounding function [
36], sigmoid function [
37], and tangent function [
38]. The experimental results showed that BitABC generated better feature subsets in shorter computational time. Moreover, many studies combine ABC with other optimization algorithms, such as DE [
39], ACO [
40], and PSO [
41], and they achieve promising results as well. However, the ABC algorithm is seldom applied to regression and prediction problems and achieve promising results.
To improve the accuracy and interpretability of QSAR that is a regression and prediction problem, we apply ABC algorithm to feature selection in QSAR. Major novelties and contributions of our study are described as follows:
- (1)
To save the process of converting continuous space into discrete space and reduce the consumption of computing resources, a two-point crossover operator and a two-way mutation operator are employed to generate food sources in employed bee phase and onlooker bee phase.
- (2)
To achieve fast convergence, a novel greedy selection strategy is employed to greatly reduce the possibility of food sources being abandoned.
- (3)
Furthermore, we investigate the influence of different threshold values that determine whether to implement the scout bee phase on the performance of QSAR and draw an interesting conclusion that the scout bee phase is redundant when dealing with the feature selection in low-dimensional and medium-dimensional regression problem.
The rest of this paper is organized as follows:
Section 2 reviews the related work of FS methods based on SI.
Section 3 briefly describes QSAR modeling and the FS problem.
Section 4 presents the basic ABC algorithm and proposes two improved ABC variants for FS in QSAR.
Section 5 describes the experimental datasets and parameter settings.
Section 6 presents the experimental results. Conclusions are given in
Section 7.
2. Related Work
SI algorithms are well-known for their global exploration capability and are gaining more attention by the feature selection community recently. It has been proven by the well-known “No Free Lunch (NFL) theorem” [
42] that there is no heuristic algorithm that can solve all types of optimization problems. Specifically, since the exploration—exploitation balance is an unsolved issue within SI algorithms, each SI algorithm introduces an experimental solution through the combination of deterministic models and stochastic principles. Under such conditions, each SI algorithm holds distinctive characteristics that properly satisfy the requirements of particular problems [
18]. Therefore, a particular SI algorithm is not able to solve all problems adequately. This motivates many researchers to investigate the effectiveness of different algorithms in different fields. Between 2010 and 2020, there have been a total of 85 papers used SI algorithms for feature selection in different fields [
30].
For the medical application, Mehrdad et al. integrated the node centrality and PSO algorithm [
43] to improve the performance on FS. Neggaz et al. [
44] applied the sine–cosine algorithm and the disruption operator to Salp Swarm Algorithm (SSA) to improve the accuracy of disease diagnosis. Mafarja and Mirjalili [
45] proposed a novel Whale Optimization Algorithm (WOA) for FS, and the crossover and mutation operators are used to enhance the exploitation of the WOA algorithm. An FS method suppressed less relevant features in the breast cancer datasets by ABC. Then, to minimize the potential of ABC being trapped in a local optimum, the accuracy of classification by GBDT is employed to evaluate the quality of the inputs [
46]. To select a DNA microarray subset of relevant and non-redundant features for computational complexity reduction, Indu et al. [
47] proposed a two-phase hybrid model based on improved-binary PSO (iBPSO). A recursive PSO method was developed by Prasad et al. [
48] for gene selection. Ahead of this, an Ant Colony Optimization-selection (ACO-S) is utilized to generate a gene subset with the smallest size and salient features while yielding high classification accuracy [
49]. Furthermore, Yan et al. [
50] hybridized the V-WSP, proposed by Ballabio et al. [
51], with PSO to improve the accuracy of laser-induced breakdown spectroscopy. Moreover, to solve the feature selection problem for acoustic defect detection, a single-objective feature selection algorithm hybridizing the Shuffled Frog Leaping Algorithm (SFLA) with an improved minimum-redundancy maximum-relevancy (ImRMR) was proposed by Zhang et al. [
52]. To handle the challenges of the network detection that detecting anomalies from high dimensional network traffic feature is time-consuming, an FA-based feature selection was attempted by Selvakumar and Muneeswaran [
53]. to obtain an optimized detection rate. In addition, FS methods based on the firefly algorithm were investigated for Arabic text classification. Ref. [
54] and facial expression classification [
55].
Additionally, various SI algorithms have been applied to FS in QSAR. Kumar et al. [
56] first used multi-layer variable selection strategy, and then used GA to select meaningful descriptors from a large set of initial descriptors. PSO has been widely applied to selection descriptors in QSAR. For instance, Shen et al. [
57] modified PSO named PSO-PLS for variable selection in MLR and PLS modeling. The hybridization of PSO with GA are used as a FS technique by Goodarzi et al. [
58]. After that, Wang et al. [
59] proposed a weighted sampling PSO-PLS (WS-PSO-PLS) to select the optimal descriptor subset in the QSAR/QSPR model. Moreover, the improved binary Pigeon Optimization Algorithm (POA) was applied to selecting the most relevant descriptors (variables) in QSAR/QSPR classification models [
60].
Compared with PSO and ACO, there are fewer studies applying ABC algorithm to FS. Most of them are applied to classification or clustering problems, and rarely used to select features for regression problems. Therefore, in this paper, the ABC algorithm is used to select features for PLS modeling, which is the most straightforward linear regression-based modeling method in QSAR.
6. Experimental Results and Analysis
Table 3 shows the experimental results of six QSAR methods.The best results are identified in boldface. Without introducing intelligent algorithms, the PLS model selects all features in each dataset, and the mean
and root mean square error are respectively 0.6 and 0.99 on the Artemisinin dataset, 0.4 and 0.85 on the BZR dataset, and 0.24 and 0.65 on the Selwood dataset. However, the performance of PLS is improved when the intelligent algorithm is introduced into the model. The experimental results show that the intelligent algorithm can eliminate the irrelevant features in the datasets by using global search or local search.
The following comparison results can be obtained from
Table 3: on the Artemisinin dataset, the mean
of ABC-PLS-1 is 3.64% larger than that of PSO-PLS and 1.48% larger than that of WS-PSO-PLS, the root mean square error of ABC-PLS-1 is 5.73% smaller than that of PSO-PLS and 2.39% smaller than WS-PSO-PLS. On the BZR dataset, the mean
of ABC-PLS-1 is 1.8% larger than that of PSO-PLS and 1.29% larger than that of WS-PSO-PLS, the root mean square error of ABC-PLS-1 is 1.49% smaller than that of PSO-PLS and 1.07% smaller than that of WS-PSO-PLS, and the features selected by ABC-PLS-1 are 5.88 less than that selected by PSO-PLS and 4.14 less than that selected by WS-PSO-PLS. On the Selwood dataset, the mean
of ABC-PLS-1 is 6.73% larger than that of PSO-PLS and 1.74% larger than that of WS-PSO-PLS, the root mean square error of ABC-PLS-1 is 7.67% smaller than that of PSO-PLS and 2.28% smaller than that of WS-PSO-PLS, and the features selected by ABC-PLS-1 are 5.8 less than that selected by PSO-PLS and 3.16 less than that selected by WS-PSO-PLS. The mean
of ABC-PLS-1 is larger than that of BFDE-PLS and the root mean square error of ABC-PLS-1 is smaller than that of BFDE-PLS on all the three datasets. However, the features selected by ABC-PLS-1 is more than that selected by BFDE-PLS in the Artemisinin dataset and the BZR dataset. ABC-PLS-1 selects more features than ABC-PLS on the Artemisinin dataset, but it is superior to the ABC-PLS on the other two datasets.
In conclusion, although the number of selected features of ABC-PLS-1 is not smaller than that of BFDE-PLS on Artemisinin and BZR, the prediction accuracy and the root mean square error of ABC-PLS-1 is obviously better than ABC-PLS, PSO-PLS, WS-PSO-PLS, and BFDE-PLS on all the three datasets.
A rank sum test method at a significance level of 0.05 is used to compare mean
on three datasets to determine whether ABC-PLS-1 is significantly different from PSO-PLS, WS-PSO-PLS, BFDE-PLS, and ABC-PLS. As shown in
Table 4, ABC-PLS-1 is significantly better than others in the mean
on all datasets.
Figure 5 shows the
obtained by each algorithm used by running 100 times on three datasets. It is obvious that the
of ABC-PLS-1 is generally higher than that of the other four algorithms on the Artemisinin dataset and the Selwood dataset. In the last subfigure, the
of ABC-PLS-1 is higher than PSO-PLS, WS-PSO-PLS, and ABC-PLS, and it is stable.
Convergence curves of the algorithms on three datasets are shown in
Figure 6. Each curve is an average result of 100 runs in each iteration. ABC-PLS-1 converges faster with a good quality of solution compared to other state of-the-art methods on Artemisinin and BZR datasets. Although BFDE-PLS finally converges to a higher quality solution than ABC-PLS-1 on the Selwood dataset, it is greatly inferior to others on Artemisinin and BZR datasets and its convergence speed is slow. Overall, ABGWO achieved better performance than others with respect to both convergence speed and solution quality.
Furthermore, in order to verify the validity of ABC-PLS-1, the Root Mean Square Error (RMSE) of ABC-PLS-1 is compared with that of the other four algorithms.
Figure 7 presents Box-plots which show the RMES of the five algorithms on three datasets. “+” in figure are outliers. As can be seen from the figure, the mean line of ABC-PLS-1 is lower than PSO-PLS, WS-PSO-PLS, BFDE-PLS, and ABC-PLS on all three datasets. Therefore, the performance of ABC-PLS-1 is better and more stable than others.
For a better evaluation of our proposed FS methods, not only the accuracy and the size of feature subsets but also the computational time is investigated. The computational time is presented in terms of mean values over the 100 runs in
Table 5. Like as other wrapper methods, the proposed algorithm requires a high computational cost to evaluate the fitness of individuals. The CPU execution time of ABC-PLS-1 is only less than that of BFED-PLS. However, it is a remarkable fact that the accuracy of feature selection method is far more important than the computational complexity of this method in many high-precision applications, such as biological genetic engineering, medical diagnosis, drug design, and discovery. In fact, in these applications, we prefer to choose the FS method with the highest accuracy, even if it is at the cost of higher computational complexity. Although the proposed ABC-PLS-1 has no edge over time consumption, it boosts the accuracy of FS in QSAR, which is exactly what QSAR modeling needs.
According to the above experimental results, we come to the conclusion that the proposed ABC-PLS-1 performs well in QSAR. In addition, to investigate whether the scout bee phase is redundant when dealing with the feature selection for low-dimensional and medium-dimensional regression prediction problem, we do further experiments on the ABC-PLS-1 by setting different values of .
Table 6 shows the experimental results of three performance metrics when the scout bee operator takes different
values on the three datasets. The best results are identified in boldface. In the case of no scout bee phase, i.e.,
, the
on the Artemisinin dataset, BZR dataset, and Selwood dataset are, respectively, 0.7731, 0.5757, and 0.9338, which are respectively 0.15%, 0.33% and 0.12% larger than that in the case of setting the
to 100; The root mean square error are respectively 0.7468, 0.7153, and 0.1906, which are, respectively, 0.25%, 0.34%, and 0.19% smaller than that in the case of setting the
to 100. The number of the selected features on the Artemisinin dataset is 0.3 smaller than that in the case of setting the
value to 100. Therefore, the scout bee operator is redundant in dealing with the feature selection for low-dimensional and medium-dimensional datasets in regression.
7. Conclusions
To improve the prediction accuracy and interpretability of QSAR modeling, two ABC variants are proposed for feature selection in QSAR in this paper, namely ABC-PLS and ABC-PLS-1. In the former variant, we convert the continuous space to a discrete space by a threshold and then apply it to feature selection in QSAR. In the later variant, to save the process of converting continuous space into discrete space and reduce the consumption of computing resources, the two-point crossover operator and the two-way mutation operator are introduced in the employed bee phase and onlooker bee phase. Furthermore, a novel greedy selection strategy is employed to help the algorithm converge fast to the optimal solution by reducing the possibility of food sources being abandoned. The performance of our proposed algorithms on feature selection in QSAR are compared with that of three state-of-the-art FS methods on three QSAR datasets. The comparison results show that, not only in terms of prediction accuracy and feature subset size, but also in terms of stability, the proposed ABC-PLS-1 outperforms other algorithms. Moreover, we also study whether the scout bee phase is necessary by setting different values of , and conclude that the scout bee phase is redundant when dealing with the feature selection in low-dimensional and medium-dimensional regression problems.
In future research, we will propose a multi-object ABC algorithm for QSAR to maximize the prediction accuracy and minimize the number of selected features, simultaneously.