**1. Introduction**

Intrinsically disordered proteins (IDPs) and intrinsically disordered regions (IDRs) play critical functions in many biological processes [1–7]. Among all the possible molecule mechanisms for the functions of IDPs/IDRs, a major one is that IDPs/IDRs physically interact with their partners through either conformational search or induced fit [8–10]. Eventually, due to them having structural flexibility, IDPs/IDRs may bind to the partners with low-affinity but high-specificity [11–15], and thus regulate the downstream biological processes. Clearly, to characterize the dynamic process of the interaction and the mechanism of regulation, the exact locations of those intrinsically disordered amino acids (IDAAs) involved in the interaction need to be determined. However, high-accuracy experimental methods for the detection of IDPs/IDRs/IDAAs are time-consuming and cost-inefficient. Besides, high-through experimental identification of disordered residues, although having attracted a lot of attention and approaches have been widely scouted [16], is still challenging and the methods are not currently available.

Consequently, many computational tools have been developed to identify IDPs/IDRs/IDAAs and associated molecular interactions. The Protein Data Bank (PDB) [17], while being used in the majority of cases for the three-dimensional structures of biomolecules, does contain information on residues with missing coordinates. These residues are interpreted as IDAAs. Furthermore, PDB also contains the structure of molecular complexes, which frequently provides information of molecular interactions involving IDPs/IDRs. DisProt [18], which is the first database of IDPs/IDRs, not only collects IDPs/IDRs/IDAAs, but also integrates the information of molecular partners. Similarly, IDEAL [19], another database of IDPs, incorporates the interaction networks of IDPs in the database. DisBind [20], DIBS [21], and MFIB [22] are three recently developed databases for IDPs/IDRs based molecular interactions. These databases can be used to search for IDAA/IDR/IDP, or to develop computational predictors for various purposes. In fact, both PDB and DisProt are frequently used for the development of disorder predictors. In addition, PDB contains complex structures formed between a short IDR and another protein. Many of these short IDRs are known as MoRFs (Molecular Recognition Features) [23]. MoRFs are the very first type of IDRs found in molecular interaction. Based on this discovery, many MoRF related predictors have developed, such as: MoRF [24], MoRFpred [25], MFSPSSMpred [26], MoRFchibi [27], MoRFPred-plus [28], and OPAL [29], among many others. Furthermore, many other predictors have been developed for the general binding site/regions within IDPs/IDRs, e.g., ANCHOR [30], SLiMpred [31], PepBindPred [32], DISOPRED3 [33], IUPred2A/ANCHOR2 [34], etc.

All these computational tools provide information on protein intrinsic disorder for different aspects. Databases are collections of experimentally observed examples; predictors can be used to analyze novel sequences. Disorder predictors identify the location and, to some extent, the scale of flexibility of IDRs/IDAAs; binding motif predictors spot the location of binding regions; other types of predictors may provide information on various structural features and functional roles. Frequently, the outputs of disorder predictors are used as input for other predictors to improve the prediction accuracy [32,35–40]. Clearly, accurate identification of IDAAs is very important for studies associated with protein structure, intrinsic disorder, interaction, and function. Therefore, improving the prediction accuracy of protein intrinsic disorder predictors is always desirable, though also a real challenge at present time. Furthermore, improving the prediction accuracy of IDAAs has other important impacts on basic science. With more and more IDPs/IDRs being discovered, our knowledge on the actual content of protein intrinsic disorder in nature is still elusive. Part of the reason is that the accuracy of existing computational tools is still not able to meet the requirements. Therefore, developing high-accuracy predictors is still in urgent need. In addition, it could also be expected that when developing new predictors, novel computational strategies could be innovated, and thus, make a much broader impact.

In our previous studies on the development of intrinsic disorder predictors [41,42], as well as studies by many other researchers [43–47], meta-strategy has been demonstrated to have multiple advantages over individual predictors that adopt a single computational strategy in the prediction. One oversimplified but straightforward explanation for the success of meta-strategy is that meta-strategy is able to combine the strengths of all individual predictors, and thus improve the prediction accuracy. Nonetheless, a direct integration of multiple individual predictors may not improve the prediction accuracy significantly [48,49], however, further integration of various data pre-processing techniques will. Data pre-processing, such as angle-shift technique in protein dihedral angle prediction, was used in artificial neural network based predictor and improved the accuracy remarkably [50]. A combination of non-linear transformation and principal component analysis-based dimension reduction together with meta-strategy was used to improve the prediction accuracy of miRNAs [48]. With these proofs, it is expected that other novel techniques can also be used to improve the prediction accuracy of protein intrinsic disorder. In this project, dual-threshold value and two-step significance voting were integrated into a decision-tree based neural network to improve the prediction accuracy of IDAAs.

#### **2. Results**

#### *2.1. Prediction Performance of Component Predictors*

The ROC (Receiver Operating Characteristic) curves of four component predictors was presented in Figure 1A. The AUC (Area Under the Curve) for DisEMBL, IUPred, VSL2, and ESpritz are 0.78, 0.82, 0.84, and 0.88, respectively. The balanced accuracy (Acc-b) of these four predictors at their default settings are: 68%, 76%, 77%, and 73%, accordingly. In Figure 1B, the overlap and

coverage between every two predictors were analyzed for the positive samples (disordered residues) and negative samples (structured residues). Here, overlap stands for the ratio of true-positive (or true-negative) predictions made by both predictors over the total number of positive (or negative) samples, and coverage is defined as the ratio of correct predictions made by either predictors over the total number of samples. Clearly, the overlap of positive samples between predictors normally ranges from ~30% to 50%; however, the number for the overlap between IUpred and VSL2 went up to ~65%. In terms of coverage, the numbers were in the range from ~60% to ~80%. For negative samples, the overlap was from ~70% to ~90%, and the coverage was normally higher than ~90%. The highest values of coverage, as shown by bars at the most right-hand side of both panels, were ~85% and 97% for positive and negative samples, respectively. These two values may outline the theoretical uplimits of combining these four predictors.

**Figure 1.** Prediction performance of four component predictors, including DisEMBL, IUPred, VSL2, and ESpritz. (**A**) ROC curves of four component predictors. The ROC curves were obtained by using the default settings of these predictors. (**B**) The pairwise overlap (gray bars) and coverage (dashed bars) for true positive predictions (upper panel) and true negative predictions (lower panel) between each pair of predictors. Axis shows pairs of predictors as follows: D-DisEMBL, I-IUPred, V-VSL2, and E-ESpritz. All-coverage on *x*-axis stands for the maximum coverage of all predictors.

#### *2.2. Use Information Gain to Choose Threshold Values*

To use the new meta-strategy, threshold values of the decision tree need to be determined first. Other than using the method based on the distribution of positive samples and negative samples as a function of prediction score [49], the information gain of all the component predictors in the dataset was analyzed and compared to the distribution of positive samples and negative samples, as shown in Figure 2. The curves of information gain can be roughly characterized by a single-peak distribution, and the location of peaks was, roughly, on the right-hand side of the cross-point where the ratio of positive samples surpassed negative samples. More specifically, the locations of peaks for DisEMBL, IUPred, VSL2, and ESpritz were around 0.5, 0.52, 0.64, and 0.26, respectively. By notation, the locations of the peaks provide a rough estimation of the threshold values, which can be used to maximally partition positive samples and negative samples. Clearly, these values can hardly be determined by the analysis of distribution of positive and negative samples.

**Figure 2.** The distribution of information gain, positive sample, and negative samples as a function of prediction score for (**A**) DisEMBL, (**B**) IUPred, (**C**) VSL2, and (**D**) ESpritz. The *x*-axis shows the prediction score, the *y*-axis on the left shows the values of information gain, and the *y*-axis on the right shows the fractions of positive samples and negative samples at different prediction scores, respectively.
