1. Introduction
Research on protein structure and folding has a long history, dating back to the seminal work by Pauling and Corey [
1]; see also [
2,
3,
4,
5]. From a computational point of view, all-atom protein structure prediction is a challenging task. Interestingly, Finkelstein and Badretdinov [
6] approximated the worst case folding time of a protein of length
n as
exp(
λ ·
n2/3 ±
χ ·
n1/2 /2)
ns, where
λ and
χ are constants close to unity. In terms of algorithmic complexity, this implies a quasi-exponential structure prediction time with an exponent ≈
n2/3, if the structure prediction follows the folding path and elementary steps reflect structure transitions. Consequently, protein structure prediction has been shown to be NP-hard for various lattice models [
7,
8].
A common approach to tackle NP-hard problems are population-based methods and search-based heuristics, such as genetic algorithms [
9], tabu search [
10], Monte Carlo methods [
11], simulated annealing [
12,
13,
14] and quantum optimization [
15,
16]. Furthermore, simplified energy models are used for approximations of tertiary protein structures, with the most popular being the Hydrophobic-Polar (H-P) energy model [
17,
18,
19,
20] and the Miyazawa–Jernigan (M-J) energy model [
21,
22]. In the H-P model, proteins are represented by chains (within a given lattice) whose vertices are marked either as H (hydrophobic) or P (hydrophilic); H nodes are considered to attract each other, while P nodes are neutral. An optimal lattice embedding is one that maximises the number of H-H contacts, where the underlying justification is the assumption that hydrophobic interactions contribute to a significant portion of the total energy function. The M-J energy model is a more realistic model of the free energy of a given conformation, since it takes into account the interactions between specific pairs of amino acids.
In general, lattice models have been shown to be useful for the study of globular proteins [
17,
19], of the conformational space induced by (simplified) protein sequences [
4,
20] and for the analysis of a number of other features important in protein structure prediction, such as long-range interactions in proteins [
23]. While typical lattice models account only for the backbone representation of protein sequences, recent research aims at incorporating the space required for side chain packing; see the
LatFittool in [
24]. As pointed out by Moreno-Hernández and Levitt in [
20], protein structure predictions carried out in practice may consist of multiple stages that combine computational models of different complexities, where simplified models are often used in the early stages of structure predictions. In the present study, we focus on lattice models as a test case, with future research aiming at the application of the new search methodology to off-lattice models.
For comprehensive information about recent developments regarding lattice-based protein structure prediction, we refer the reader to [
25,
26]. In the centre of the present paper is the adaptation of a brand new population-based method to protein structure prediction, namely, the Firefly Algorithm (FA) devised by X.S. Yang [
27,
28]. To the best of our knowledge, FA has been applied before to protein structure prediction only in [
29] (the 2D H-P model).
For the H-P model and face-centred-cubic (FCC) lattices, a new tabu search heuristic is applied in [
30] to 21 sequences of a length between 90 and 279. The new heuristic is able to improve on previously known minimum energy values in double digit percentages. The details of how to select favourable intermediate conformations in a population-based search are addressed in [
31] (the H-P model and FCC lattices). The information can then be utilised for tabu search or related methodologies. Two energy models, namely the H-P model and a contact matrix similar to the M-J model, proposed in [
32], are merged in [
33,
34]. The authors obtain improvements over existing methods in terms of the root mean square deviation (RMSD, the overall distance between the predicted structure and the native folding identified by X-ray crystallography or nuclear magnetic resonance spectroscopy) for a number of sequences out of 12 proteins with a length ranging from 54 to 160.
In [
35], the authors devise a set of general guidelines for the application of Ant Colony Optimisationto protein structure prediction in lattice models. The guidelines are derived from computational experiments on 2D and 3D rectangular lattices combined with the H-P model.
In our study, we try to demonstrate the advantage of the Firefly-inspired method in terms of energy function evaluations, calculated over all individual runs of the population-based method. The results encourage us to analyse further larger instances for the H-P model and to adapt the approach to off-lattice models.
3. Results and Discussion
In our experiments, we tested the performance of our algorithm on both the H-P and M-J energy functions on three-dimensional cubic and FCC lattices. We take the minimum and average results and objective function counts from five runs of each combination. All experiments are run with a population size of eight and a default interval of T = 15. By selecting such a small value of T, we try to avoid that conformations different from Sbest do not move for too many steps structurally towards a conformation that is no longer the best with respect to the energy value. Future research will include finding out whether or not one can identify a range of T (depending on the sequence length) more appropriate for this particular parameter setting.
The computational experiments were executed on computers with Intel Xeon E5-4603 CPUs and 16 GB of RAM, running Debian 6 x64. The runtimes for the HP1–HP10 sequences ranged from 35 to 142 minutes on the 3D cubic lattice and from 81 to 250 minutes on the FCC lattice. For sequences MJ1-MJ5 over the cubic lattice, runtimes ranging from 89 to 221 minutes were observed. We expect these runtimes to improve with further fine-tuning of the algorithm and parameter settings.
The performance of algorithms is difficult to compare solely based upon runtime values, since the particular runtimes are affected by the underlying hardware and implementation details. Therefore, we provide data about the total number of objective function evaluations, which is denoted by Oval. The notation has been chosen in order to cover both energy function evaluations according to
Equation (7) and
Equation (8), which are counted for the member of the population labelled as
Sbest and the number of ‘structure-based’ objective function evaluations according to
Equation (17), which are executed for each of the remaining
N − 1 members of the population. Thus, if
K =
k ×
T +
T′ is the number of steps (without the initial stage) until the termination of the procedure, where 1 ≤
T′ ≤
T, then:
The values of Initi cover the initial stage before the selection of
Sbest, and (
k + 1)(
N − 1) accounts for the selection of
Sbest at the beginning of each complete
T-interval and
T′ (we note that the energy value of
Sbest is known at the beginning of each interval). By selecting such a performance measure, it is also possible to circumvent the runtime impact of how the underlying lattices (which are equal for the same problem setting) are implemented and handled in computational experiments.
In the literature, comparable data with respect to function evaluations are provided in [
52] with respect to the H-P model (ten benchmarks from
Table 1), and in [
41] for the M-J model (six benchmarks from
Table 2); in both cases, for the three-dimensional cubic lattice. The function evaluations are counted for the respective energy function, and therefore, we denote by Eval the total number of energy function evaluations. We note that in [
52], the value of Eval is equal to the product of the population size (= 20) and the number of ‘generations’. Since the number of ‘generations’ reported in [
52] is the average value over 10 independent runs, we use Eval
avg for comparison purposes. In [
41] (see Tables 9-14 there), 10 independent runs are reported individually with their respective Eval value and minimum energy value. The population size for each of the runs is 100. For comparison purposes, we calculated the average Eval
avg value over 10 runs for each of the six M-J instances.
In our computational experiments, we executed five independent runs, and we denote by Oval
avg the average number of objective function evaluations according to
Equation (18) for each individual benchmark problem. For comparison to [
41] and [
52], we introduce the parameter:
Table 4 displays our results using the H-P model on the three-dimensional cubic lattice. It shows that, over the 10 benchmark problems, our method finds an optimal conformation for every problem. This is comparable to other methods; as
Epub shows, the method presented in [
52] achieved the same performance.
Table 4.
Results for the H-P model on 3D cubic lattices for benchmarks from
Table 1.
values are from the results published in [
52].
Table 4.
Results for the H-P model on 3D cubic lattices for benchmarks from
Table 1.
values are from the results published in [
52].
ID | Ebench | Epub | Zmin | Zavg | | Ovalmin ×106 | Ovalavg ×106 | Ovalmax ×106 | speed-up |
---|
HP1 | -32 | -32 | -32 | -29.8 | 7.43 | 11.28 | 17.53 | 24.62 | 0.4 |
HP2 | -34 | -34 | -34 | -33.4 | 74.57 | 7.87 | 16.02 | 26.16 | 4.7 |
HP3 | -34 | -34 | -34 | -32.2 | 19.19 | 6.61 | 12.65 | 22.97 | 1.5 |
HP4 | -33 | -33 | -33 | -30.4 | 12.03 | 8.95 | 14.39 | 23.96 | 0.8 |
HP5 | -32 | -32 | -32 | -31.2 | 12.78 | 4.58 | 10.33 | 16.69 | 1.2 |
HP6 | -32 | -32 | -32 | -30.8 | 10.63 | 12.15 | 17.42 | 19.08 | 0.6 |
HP7 | -32 | -32 | -32 | -30.6 | 917.37 | 6.35 | 12.99 | 15.84 | 70.6 |
HP8 | -31 | -31 | -31 | -29.0 | 15.20 | 6.34 | 12.98 | 26.54 | 1.2 |
HP9 | -34 | -34 | -34 | -32.6 | 23.95 | 9.35 | 15.94 | 24.62 | 1.5 |
HP10 | -33 | -33 | -33 | -32.0 | 42.46 | 3.88 | 7.80 | 17.24 | 5.4 |
|
average speed-up total | 8.8 |
|
average speed-up, leaving out HP1 and HP7 | 2.1 |
Except for HP1, HP4 and HP6, our method improves on the number of energy function evaluations reported in [
52], and on eight out of the ten instances, Oval
min is smaller than
. If the worst (HP1) and best (HP7) performances are left out, we achieve an average speed-up of 2.1.
A potential explanation for the relatively large value of
for HP7 could be the landscape analysis carried out in [
14] for this instance: HP7 exhibits the smallest γ value, which results in a termination criterion larger by about one margin compared to most of the other sequences (see
Table 1 in [
14]).
Table 5 shows our results for the H-P model over the FCC lattice. This combination shows similar performance to the three-dimensional cubic lattice, in that both our method and the published [
53] method found an optimal conformation in 100% of the benchmark problems. We can also see that in 50% of benchmarks (HP2, HP3, HP4, HP5 and HP8), we achieved a better average conformation value than the results published in [
53].
Table 5.
Results for the H-P model on 3D FCC lattices for benchmarks from
Table 1. E
pubMin and E
pubAvg are the published minimum and average results from [
53]. Published energy function evaluation counts are not available.
Table 5.
Results for the H-P model on 3D FCC lattices for benchmarks from
Table 1. E
pubMin and E
pubAvg are the published minimum and average results from [
53]. Published energy function evaluation counts are not available.
ID | Enat | EpubMin | EpubAvg | Zmin | Zavg | Ovalmin ×106 | Ovalavg ×106 | Ovalmax ×106 |
---|
HP1 | -69 | -69 | -67.37 | -69 | -66.2 | 18.98 | 23.57 | 29.74 |
HP2 | -69 | -69 | -66.97 | -69 | -67.8 | 25.08 | 47.97 | 69.68 |
HP3 | -72 | -72 | -68.80 | -72 | -69.8 | 23.69 | 32.22 | 39.77 |
HP4 | -71 | -71 | -68.10 | -71 | -69.4 | 15.89 | 21.70 | 27.46 |
HP5 | -70 | -70 | -67.77 | -70 | -68.0 | 18.70 | 31.70 | 38.92 |
HP6 | -70 | -70 | -66.93 | -70 | -66.2 | 13.81 | 23.80 | 28.37 |
HP7 | -70 | -70 | -67.57 | -70 | -65.0 | 25.59 | 37.51 | 42.22 |
HP8 | -69 | -69 | -66.37 | -69 | -68.4 | 21.11 | 33.52 | 41.38 |
HP9 | -71 | -71 | -69.10 | -71 | -68.8 | 16.17 | 23.12 | 29.40 |
HP10 | -68 | -68 | -66.47 | -68 | -65.6 | 32.14 | 45.22 | 53.72 |
To our knowledge, there are no published energy function evaluation counts for the benchmark problems HP1–HP10 over the 3D FCC lattice. Comparing to our results for the three-dimensional cubic lattice, there is an increase in energy evaluations over all benchmark problems. This is to be expected considering the additional complexity of the 3D FCC lattice.
Table 6 shows our results from five runs for the M-J model on the three-dimensional cubic lattice. In this combination, we achieved an optimal conformation in four out of the six benchmarks. We recall that the
values are the average values of energy function evaluation counts over all 10 runs obtained by PLS in [
41] for the corresponding benchmark. We note that except for MJ3, not all of the 10 runs reported in [
41] reached the optimum value (ranges from six up to nine runs on the remaining instances).
On four of the benchmarks, we achieved the optimum free energy value. With the
ad hoc parameter settings we used, we missed out by a margin on MJ1 and and MJ4. It should be noted that the speed-up of objective function evaluations in comparison to PLS from [
41] is 1.5 and 1.4 on MJ1 and MJ4, respectively. Therefore, we think it is justified to say that, with a slightly higher computational effort, our approach will be able to achieve the optimum free energy values on these two instances. When leaving out MJ3 and MJ5, the average speed-up is 1.2 (otherwise, 1.1). As mentioned before, the population size in [
41] is 100, and we presume that with a moderate increase of the population size for the Firefly-inspired algorithms, a stronger speed-up can be achieved. Furthermore, it is important to note that in [
41], the runs terminate when a structure with minimum free energy has been found, which is not the case in our computational experiments.
Table 6.
Results for the M-J Model on 3D cubic lattices for benchmarks from
Table 2. E
pub are the results published in [
41].
are the average energy function evaluation counts for PLSfrom [
41].
Table 6.
Results for the M-J Model on 3D cubic lattices for benchmarks from
Table 2. E
pub are the results published in [
41].
are the average energy function evaluation counts for PLSfrom [
41].
ID | Enat | Epub | Zmin | Zavg | | Ovalmin ×106 | Ovalavg ×106 | Ovalmax ×106 | speed-up |
---|
MJ1 | -25.85 | -25.85 | -24.18 | -19.58 | 33.01 | 15.24 | 21.56 | 28.12 | 1.5 |
MJ2 | -25.92 | -25.92 | -25.92 | -24.16 | 12.81 | 17.37 | 28.33 | 33.87 | 0.5 |
MJ3 | -26.09 | -26.09 | -26.09 | -24.92 | 6.02 | 14.76 | 23.79 | 27.35 | 0.3 |
MJ4 | -25.87 | -25.87 | -23.59 | -18.30 | 34.58 | 18.54 | 24.59 | 23.66 | 1.4 |
MJ5 | -26.15 | -26.15 | -26.15 | -25.01 | 23.02 | 7.91 | 13.20 | 19.09 | 1.7 |
MJ6 | -26.24 | -26.24 | -26.24 | -22.73 | 36.32 | 13.73 | 25.96 | 36.80 | 1.4 |
|
average speed-up total | 1.1 |
|
average speed-up, leaving out MJ3 and MJ5 | 1.2 |
Table 7 shows our results obtained by using the energy function from [
32] over the FCC lattice compared to those from [
33].
Table 7.
Results for the energy pairwise interactions energy function from [
32] on 3D FCC lattice for the benchmarks from
Table 3.
Epub are the published results from [
33].
Table 7.
Results for the energy pairwise interactions energy function from [
32] on 3D FCC lattice for the benchmarks from
Table 3.
Epub are the published results from [
33].
ID | length | Epub | Zmin | Zavg | Ovalmin ×106 | Ovalavg ×106 | Ovalmax ×106 |
---|
4RXN | 54 | -166.88 | -166.21 | -158.90 | 20.05 | 39.82 | 52.78 |
1ENH | 54 | -153.79 | -150.64 | -144.12 | 16.86 | 35.09 | 47.52 |
4PTI | 58 | -210.29 | -213.52 | -200.86 | 23.98 | 31.88 | 58.56 |
2IGD | 61 | -183.18 | -185.15 | -179.88 | 34.71 | 64.87 | 111.22 |
Our results are comparable to [
33] on instance 4RXN, worse on instance 1ENHand better on the longer instances, 4PTIand 2IGD. Unfortunately, no data are available in published literature for comparison of the number of energy function evaluations.