2.3.2. Conservation of Sequence, Length and Dynamics of Linkers

Modelling (Figure 2) suggests that the length, structural disorder and rigidity of the linker are key elements of processive behavior, which may be in (co)evolutionary link with the typical distance between binding sites (step size) of the given system. This inference also suggests evolutionary constraints on the length and physical properties of the linker regions in these enzymes. We address this issue next.

Regarding evolutionary conservation, IDPs/IDRs have been roughly classified into three classes [43], constrained (where both sequence and structural disorder are conserved), flexible, where sequence varies but structural disorder is conserved, and non-conserved where both lack evolutionary conservation. The underlying assumption in this classification is that disordered regions that function by molecular recognition tend to have conserved sequences, whereas those having linker function are free to evolve, as long as they preserve their structural disorder. As shown in our modelling studies (Figure 2), however, spatial confinement does limit the acceptable length and flexibility of the linker. We assessed these features of the linkers for the 12 DLD-type processive enzymes in Table 1.

In agreement with this expectation, their length shows notably narrower distribution than that of all disordered regions and all disordered linker regions in the DisProt database [44]. Processive enzymes have no short (<30 residues) or long (>150 residues) linkers, although there are many such examples of IDRs in general (Figure 4A). Furthermore, there are characteristic differences between the different DLD enzyme families (Figure S2), which also suggests a co-evolutionary relationship with the typical step size the enzyme takes. When the mean of the linker length of different families is plotted as a function of unit size of different substrates (Table S2), we can see an increase in linker length with the lengthening of processive steps (Figure 5).

**Figure 4.** Length distribution and conservation of linker regions in DLD type processive enzymes. (**A**) Length distribution of linkers in DLD enzymes (Table 1), in comparison with that of all disordered regions and disordered linkers in the DisProt database [44]. (**B**) Comparison of the variance (mean values of the data ± SD) of structural disorder (predicted by IUPred [41]) flexibility (as approximated by the ratio of flexible residues predicted by DynaMine [45]) and sequence (assessed by DisCons [22]) of the linkers (L) and their flanking domains (D1 and D2) of the processive DLD type of enzymes (from Table 1) calculated for sequences in species given in (Table S2). Sequence conservation is defined in Section 4 Data and Methods.

This suggests an adaptation of linker length to the geometry of the actual substrate, which also explains: (i) very similar linker length of different processive enzymes functioning on the same substrate, and (ii) the lack of very short and very long linkers in this functional class (Figures 4A and 5).

Their particular function also suggests that selection pressure may also act on their flexibility. As suggested by the above classification [43], classical entropic-chain linker functions are manifested in flexible disorder, where the sequence of the disordered region is rather free to vary, but structural disorder itself is conserved; this is what is expected for the linkers of DLD-type processive enzymes. Therefore, we analyzed the evolution of these features next (Figure 4B). First, we have shown that structural disorder of DLD linkers is highly conserved (as defined in Section 4 Data and Methods), i.e., it shows very little variation. This does not necessarily entail conservation of the sequence (as suggested by flexible disorder [43]), in fact we observe that linker sequences are rather free to vary. Even though

structural disorder of the linkers is conserved, it may not necessarily mean that their level of flexibility is maintained at the same level, although this is a critical feature of linkers for the level of processivity (cf. Figure 2). Actually, it was experimentally shown for a similar linker by NMR that despite extreme sequence variation, the flexibility of a linker is maintained [46]. To formally address this issue in DLD linkers, we applied the DynaMine tool developed for assessing local dynamics of IDP backbones [45]. As expected, the overall flexibility of the linker is very high and hardly varies in any of the processive enzymes (Figure 4B).

**Figure 5.** Linker length in DLD enzymes correlates with step size. Linker length in amino acids of the DLD-type processive enzymes (Table 1) is plotted as a function of the unit (step) size in the given substrate. The unit size is the size of the elementary unit (e.g., cellobiose in cellulose, nucleotides in RNA and DNA cf. Table S4) derived from the geometry of the substrate, which is the first approximation of the size of elementary steps the enzyme may take along the given substrate. The linear fit shows the correlation between the two (R2 = 0.4998), whereas horizontal dashed lines show the shortest and longest linker that occurs in DLD processive enzymes (Figure 4A).

Another characteristic closely linked with flexibility of linkers is their charge state, i.e., net charge and charge distribution, because they are among the primary determinants of the chain dimensions and conformational classes of IDPs [47], and even in the lack of hydrophobic groups, polar IDPs/IDRs may favor collapsed ensembles in water. To evaluate sequence polarity, usually the net charge per residue (NCPR), total fraction of charged residues (FCR) and the linear distribution of opposite charges (characterized by κ value) [48] are considered. Interestingly, for all the DLD linkers, their NCPR is low and their FCR is below the threshold of 0.2 (Figure S3), suggesting that they tend to have very similar behavior (they are weak polyampholytes), preferentially populate collapsed states [48]. Their low κ value (Table 1), however, suggests that they tend to have coil-like conformations. It is of note that high proline content may make the structure more extended than simply suggested by charge distribution suggests. In our case, eight out of 12 proteins have high proline content, with the exception of the two proteins in the boundary region (1: Human RNAse H1 and 5: *Clostridium cellulolyticum* Cel48F, cf. Table 1), which do not have high proline content.
