*2.3. Performance of the New Predictor*

Table 1 shows the prediction performance of the new meta-strategy compared to the component predictors, as well as another four recently-developed predictors under five-fold cross-validation. In brief, the performance of new meta-strategy developed in this project was obviously better than others. In terms of accuracy (Acc), balanced accuracy (Acc-b), Matthews Correlation Coefficient (MCC), F1 score, Area Under ROC Curve (AUC\_ROC), and Under Precision-Recall Curve (AUC\_PR), the new prediction strategy achieved 84.2%, 83.1%, 0,635, 0.744, 0.899, and 0.788, respectively, and was ranked at the first place among all eight different predictors. The new meta-strategy was ranked at the second place on sensitivity (Sens), with one percentage point behind VSL2. With regard to specificity (Spec), the new strategy was inferior to the predictors ESpritz (94%), DisEMBL (91.4%), AUCpreD (90.9%), IUPred2 (87.7%), and IUPred (87.4%). Regardless, it should be noted the Sens values of these predictors are at least 15 percentage points lower than the new meta-strategy.

**Table 1.** Prediction performance of the new strategy under five-fold cross-validation, in comparison with four component predictors, another four recently-designed predictors.


Note Bene. The measures of predictor performance include: sensitivity (Sens), specificity (Spec), accuracy (Acc), balanced accuracy (Acc-b), Matthews Correlation Coefficient (MCC), F1 score, Area Under ROC Curve (AUC\_ROC), and Area Under Precision-Recall Curve (AUC\_PR). The highest value in each of these measures is in bold and highlighted (red).

The performance of all these nine predictors was also assessed using the independent dataset as shown in Table 2. By comparing the data of Tables 1 and 2, it is obvious that although the numbers have fluctuations, the overall levels and trends of all the measures of prediction performance are essentially the same.


**Table 2.** Prediction performance of all nine predictors in the independent dataset.

Note Bene. The new strategy was optimized five times under five-fold cross-validation. Therefore, the performance was also tested in the independent test dataset five times. The results shown in the table is the average of all five times. The highest value in each of these measures is in bold and highlighted (red).

The performance of this new meta-strategy, as well as other predictors, for twenty types of amino acids was analyzed using balanced accuracy in Figure 3. Overall, the new meta-strategy has the highest Acc-b values in fifteen types of residues. The new meta-strategy was also ranked first together with the more recent predictor MFDp2 for residues P and Q. However, the new meta-strategy was ranked at the second position for C, N, and Y, with several percentage points behind MFDp2.

**Figure 3.** Comparison of balanced accuracy for twenty types of amino acids. The *x*-axis shows amino acid types in the alphabetic order, while the *y*-axis shows the value of balanced accuracy. For each type of amino acid, the predictors from left to right are: DisEMBL, IUPred, VSL2, ESpritz, PONDR-FIT, MFDp2, IUPred2A, and AUCpreD.

The balanced accuracies of all predictors for terminal residues were also analyzed in Figure 4. Obviously, the accuracy is location and predictor dependent. For many predictors, the closer to the termini, the lower the accuracy. For N-terminal residues, IUPred, ESpritz, MFDp2, and IUpred2 achieved ~67% balanced accuracy, which was also largely location independent. For DisEMBL, VSL2, and AUCpreD, the balanced accuracies increased gradually from ~55% to ~65% in the window from the 5th to the ~15th residues and then kept similar accuracy afterwards. The newly designed meta-strategy had a lower balanced accuracy of ~52% for the first several residues. The accuracy then increased gradually to ~63% at the 25th residue. PONDR-FIT, a more recently developed predictor, was the least accurate predictor for N-terminal residues, especially in the range from the 10th to the 20th residues where its accuracy was 2–5 points lower than the new strategy. For C-terminal residues, the patterns of accuracy were different from N-terminal residues. First, the balanced accuracy was higher in

general than N-terminal residues by several percentage points. Second, although the accuracies of predictors were still either location-independent or location-dependent, the values of accuracies were highly diversified. AUCpreD, MFDp2, IUPred, IUPred2, and ESpritz made location-independent predictions for C-terminal residues, however, the accuracy of these predictors spread from ~74% to 68%, accordingly. DisEMBL, VSL2, and PONDR-FIT's accuracy increased gradually from ~55 to62% at the 5th residue to ~67% at the 20th residue. The accuracy of the newly designed strategy for C-terminal residues was at the lower-end for the first several terminal residues, though increased consistently and achieved the highest balanced accuracy for residues at the ~20th position.

**Figure 4.** Balanced accuracy of (**A**) N-terminal and (**B**) C-terminal residues. The *x*-axis shows the distance from the first (N-terminal) or the last (C-terminal) residue. The analysis starts at the fifth residue on both N- and C-termini. The *y*-axis shows the value of the balanced accuracy.

With these observations, all the samples were regrouped into three new datasets each containing the first 25 N-terminal residues, the first 25 C-terminal residues, and the middle region, respectively. The meta-strategy was re-trained in three different datasets separately. The prediction performance of all predictors in all three regions under five-fold cross-validation was compared and analyzed in Table 3. Evidently, compared to the results in Figure 4, the prediction accuracy of terminal residues improved substantially. More specifically, the values of improvement of accuracy, balanced accuracy, F1, MCC in N-ter, Mid, and C-ter datasets ranged from 1 to 5 percentage points. For sensitivity and specificity, since many other predictors were trained to maximize either sensitivity or specificity, the new meta-strategy was normally not able to compete with them.


**Table 3.** Comparison of prediction performance under five-fold cross-validation of eight predictors, as well as the new strategy trained for N-terminal, middle region, and C-terminal residues. The highest value in each of these measures is in bold and highlighted (red).

The performance of all the predictors were then tested in CASP10 dataset and then compared to DISOPRED3, which is one of the two best predictors in CASP10 competition (see Appendix A for more details). In brief, DISOPRED3 and AUCpreD have very similar performance and are better than other predictors on multiple measures, such as specificity, accuracy, MCC, F1, and AUC-ROC. PONDR-FIT achieved the highest balanced accuracy. The new meta-strategy has the highest sensitivity. In addition to the whole dataset analysis, the per-sequence accuracy was also analyzed. The balanced accuracy of PONDR-FIT, MFDp2, AUCpreD, and the new meta-strategy in CASP10 dataset was compared in Figure 5A. All the symbols above the diagonal line represent sequences with higher accuracy when predicted using PONDR-FIT, MFDp2, or AUCpreD, and vice versa. For symbols in the dashed circle, the prediction accuracies of the compared four predictors are all not satisfactory. Symbols in dashed box constitute another group of sequences of which the prediction accuracy of the new meta-strategy is much higher than the other three predictors. For pair-wise comparison between predictors, there are more open circles above the diagonal line, more triangles under the diagonal line, and similar numbers of filled circles on both sides of the diagonal line. Thus, PONDR-FIT (open circles) has better per-sequence prediction performance in the CASP10 dataset. The new meta-strategy and AUCpreD achieved similar results on per-sequence prediction performance. Since the new meta-strategy also made a very low-accuracy prediction on some of the sequences, analyzing the potential reasons could be beneficial. For this purpose, the per-sequence balanced accuracy, fraction of experimentally validated IDAAs per sequence, and the length of each sequence were analyzed in Figure 5B. In this figure, it is apparent that sequences with a very low fraction of experimentally validated IDAAs have very low accuracy. Therefore, the fraction of IDAAs is a critical factor for the performance of the new meta-strategy.

**Figure 5.** (**A**) Comparison of per-sequence balanced accuracy among AUCpreD (filled circle), PONDR-FIT (open circle), MFDp2 (filled triangle), and this work on sequences in the CASP10 test dataset. The reasons for selecting these predictors are: (1) they are developed in recent years; (2) they have higher performance on some of the accuracy measures; (3) for simplicity of visualization, only four predictors were selected. The *x*-axis shows the per-sequence balanced accuracy of this work, and the *y*-axis shows the per-sequence accuracy of the other three predictors. (**B**) Per-sequence balanced accuracy of this work (*y*-axis) as a function of the fraction of experimentally validated intrinsically disordered amino acids (IDAAs) (*x*-axis). The size of the symbol is proportional to the length of the sequence.

#### **3. Discussion**

Intrinsically disordered proteins play critical roles in biomolecular interaction and signaling; therefore, identifying these residues is crucial for the subsequent analysis and biological studies of the functions and mechanisms. Many experimental techniques have been designed for characterizing these residues. Nonetheless, these techniques are normally time-consuming and/or cost-inefficient. Besides, these techniques may not be appropriate for proteomic studies, although many new approaches are under development [16,51,52]. Therefore, using computational tools to predict intrinsically disordered residues becomes practical, especially for novel protein sequences. Under this situation, using high-accuracy predictors is essential. However, as shown in the previous analysis, the current levels of prediction accuracy of many disordered predictors still have a lot of room for improvement.

There are multiple ways to improve the accuracy of machine learning based techniques. Tuning the list of input features is often the first trial. Recently, deep learning and meta-strategy have also been applied to improve the prediction accuracy. Our previous studies and the studies of other groups [41–47] a direct application of meta-strategy may not lead to the improvement of prediction accuracy, although it has been demonstrated that meta-strategy has many advantages [48]. In these cases, novel data processing techniques are very helpful [48,49]. Therefore, in this project, a dual-threshold was employed; two-step voting with different accuracy stringency was also integrated in the pipeline, based on the analysis of information gain. These techniques eventually contributed remarkably to the improvement of prediction accuracy. The outcomes of this new strategy demonstrate that: (1) integrating lower-accuracy predictors is able to produce higher-accuracy output; (2) the improvement of prediction performance of meta-strategy is significant and impressive, compared to individual predictors and other state-of-the-art predictors, including deep-learning based predictors; (3) the meta-strategy has well-balanced results for sensitivity and specificity, and therefore, is able to achieve higher values on other evaluation quantities, such as F1, MCC, etc.; (4) the meta-strategy provides novel ideas on the renovation of existing predictors.

Many data-processing techniques could be integrated into the meta-strategy. In this project, dual-threshold and two-step significance voting were designed and were critical for the improvement of prediction performance. Dual-threshold refers to true prediction and false prediction having different threshold values. By using dual-threshold, it is possible to control the increase of false positive rate and false negative rate. Two-step voting is a technique to use two sets of threshold values at two steps. At the first step, a set of more stringent threshold values are used, and at the second step less-stringent threshold values are used. In this way, the results from the first step have higher reliability than the second step. Significance-voting is another very useful technique complementary to the well-known majority-voting. When using majority-voting, the number of predictors making true predictions and the number of predictors making false predictions competes to determine the final results. In the application of significance-voting, the Euclidean distance of a prediction score from the corresponding threshold value is calculated, then the sum of distances of predictors making true predictions is compared to that of predictors making false predictions. Clearly, this technique is also beneficial for reducing the prediction error. For majority-voting based strategy, overlap is a critical measurement. However, in significance-voting based predictor, although overlap is still very important, coverage plays a more critical role. In addition, results from majority-voting and from significance-voting predictors have different preferences. Majority-voting is strong in selecting part of the true predictions that have very high confidence. However, significance-voting is able to pick up additional true predictions that cannot be identified by majority-voting.

When selecting individual predictors, overlap and coverage between a pair of predictors or among multiple predictors can be calculated and used to check the similarity of two predictors, and to evaluate whether the combination of these two predictors is able to improve final prediction accuracy. If the two predictors have extremely high overlap and very low coverage, these two predictors are very similar to each other in terms of the predictive results, and vice versa. Evidently, these two types of situations need to be avoided in most cases when selecting the component predictors. Normally, the selected component predictors should have a reasonably level of overlap and a higher level of coverage. The values of coverage also provide an estimation on the maximum values of true-positive and true-negative predictions by combining a pair or several predictors.

It should also be noted that most experimental work aiming at IDAA validation is focused on in vitro approaches, and consequently, the corresponding data analysis and computational strategies are also focused on in vitro data. Regardless, the in vitro foldability of amino acid residues could be very different from in vivo environment [53]. Therefore, novel ideas to develop large-scale in vivo conformational assays are also urgently needed. In fact, novel in vivo labeling strategies of IDAAs have been proposed [53]. It is hopeful that these in vivo techniques or at least the data of in vivo studies will be eventually incorporated into novel predictors of in vivo foldability of IDAA.

### **4. Materials and Methods**

DisProt v7.0 and PDB (Protein DataBank) were combined to build the dataset of disordered residues. DisProt contains over 800 protein sequences, in which the IDAAs/IDRs have been identified using various experimental techniques, such as X-ray, NMR, circular dichroism (CD) spectrometry, proteolysis, etc. For all the DisProt sequences, IDAAs have already been annotated. PDB sequences were extracted using the PISCES server [54]. All the PDB structures in the list have 2.5 angstrom or better resolution and 30% or less sequence identity. Then, 20% of the PDB sequences were randomly selected for further analysis. The missing residues in these PDB sequences were assigned as IDAAs, while all other residues were determined to be structured residues. All the extracted sequences from both DisProt and PDB were further filtered using CD-HIT [55] to remove sequences with 30% or higher sequence identity. Finally, there are 312 protein sequences, containing 30,140 disordered residues and 75,945 structured residues. All the sequences with X-ray structures in CASP10 [56] were also extracted. These sequences were each aligned with all the sequences in the above-mentioned main dataset to check the sequence identity. Only sequences with 30% or lower sequences identity were kept to make the second independent test dataset. This second independent test dataset has 35 sequences.

The infrastructure of the meta-strategy is shown in Figure 6. The prediction results of DisEMBL [57], IUPred [58], VSL2 [59], and ESpritz [60], were used as input. The major reasons for choosing these four predictors are as follows: (1) these predictors were designed using very different strategies. DisEMBL uses artificial neural networks. IUPred uses knowledge-based interaction potential. VSL2 uses neural networks on sequences of different lengths. ESpritz applied bidirectional recursive neural network (BRNN) and was trained separately on N-terminal, C-terminal, and the general sequences; (2) they achieved relatively higher prediction accuracy; (3) these predictors have standalone versions. These four scores were then fed into a decision-tree based artificial neural network (DBann) to make the final prediction. The DBann combines four specific techniques including dual-threshold, significance-voting, two-step selection [49], and two-hidden-layer Artificial Neural Network (ANN). Dual-threshold is a technique using different threshold values for true prediction and false predictions. Significance-voting is complementary to majority-voting by calculating the Euclidean distance of prediction scores to their corresponding threshold values and then comparing the distances of true predictions and false predictions to make selections. For example, when two predictors make true predictions and another two predictors make false predictions, comparing the number of true predictions (NT) and the number of false predictions (NP) may have limited usage. In this case, comparing the sum of distances from true thresholds value (dT) and the total distance from false threshold values (dF) provides more useful information of the relative significance of true predictions and false predictions. Two-step selection uses two sets of dual-threshold values together with significance-voting as follows: (1) use more stringent values as the first-step threshold values for both true predictions and false predictions; (2) select less stringent values as the second-step threshold values for both true predictions and false predictions; (3) if the numbers of predictors for true prediction and false prediction are equal in the first-step, second-step examination will be performed. If the numbers are still the same, the significance voting will be carried out; (4) based on the results of the above-mentioned comparison, the predictive results of individual predictors will be encoded differently. The encoded predictive results will then be fed into the two-hidden-layer ANN, which is a fully connected ANN and has ten and two nodes in the input and output layers, respectively, as well as twenty nodes in both hidden layers. The activation function for all the nodes is hyperbolic tangent function. In addition, in the output layer, the output was further transformed using a soft matrix function.

**Figure 6.** Infrastructure of the new meta-strategy. NT and NF are the numbers of predictors making true prediction and false prediction, respectively. "a1" and "a2" are the differences of prediction score from the 1st-step threshold and the 2nd-step threshold values, respectively. The letter subscripts represent DisEMBL (D), IUPred(I), VSL2(V), and ESpritz(E), accordingly. "dT" and "dF" are Euclidean distances of prediction scores from their corresponding threshold values for true predictions and false predictions, accordingly.

All the selected sequences were grouped into two datasets. One contains a randomly-selected 20% of all the samples and was set as the independent test dataset, while the other, containing the rest 80% of the samples, was designated as the training and validation dataset. The ratios of positive samples (disordered residues) to negative samples (structured residues) in two datasets are roughly the same. The training and validation dataset was further split into five subsets for five-fold cross-validation. In brief, three out of five subsets were used to train the predictor, the forth subset was used to prevent overfitting, and the last one was used to validate the final prediction performance. By using the different subsets for training, preventing overfitting, and validation, the aforementioned process was repeated five times. The final prediction performance was the average of all five times in the validation subsets. The trained predictors were also evaluated in the independent test dataset.

The performance of predictors was assessed using Sensitivity (Sens), Specificity (Spec), Accuracy (Acc), balanced accuracy (Acc-b, the average of sensitivity and specificity), F1 score (F1), Matthews Correlation Coefficient (MCC), Area Under ROC Curve (AUC, or AUC\_ROC), and Area Under precision-recall Curve (AUC\_PR) under five-fold cross-validation and in independent datasets. The performance of newly designed predictor was compared to four component predictors (DisEMBL, IUPred, VSL2, and ESpritz), as well as another four recently developed predictors, including PONDR-FIT [42], MFDp2 [61], IUPred2A [34], and AUCpreD [62].

Information Gain (IG) was calculated as a function of predictive score as follows:

$$IG(x) = \sum\_{i=1,2} p\_i \log\_2 p\_i - \sum\_{j=1,2} f\_j(x) \sum\_{k=1,2} p\_{j,k} \log\_2 p\_{j,k} \tag{1}$$

In which, *pi* is the fraction of positive (*i* = 1) or negative (*i* = 2) samples in the dataset; "*x*" is the threshold prediction score to split the dataset into two groups; fj(*x*) is the fraction of samples with prediction score higher than the threshold (*j* = 1) or the fraction of samples with prediction score lower than the threshold (*j* = 2); and *pj*,*<sup>k</sup>* refers to the fraction of positive samples (*k* = 1) or negative samples (*k* = 2) in the j-th group.

**Author Contributions:** Conceptualization, B.X.; Methodology, B.X. and B.Z.; Software, B.Z.; Validation, B.Z.; Formal Analysis, B.X. and B.Z.; Writing, B.X.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors acknowledge with thanks the usage of DisEMBL, IUPred, VSL2, ESpritz, PONDR-FIT, MFDp2, IUPred2A, AUCpreD, and DISOpred3.

**Conflicts of Interest:** The authors declare no conflict of interest.
