*4.5. Statistics*

To estimate the threshold for significance for the number of columns with a given conservation, the probability of a column being conserved in a particular pattern was modelled as a Poisson distribution with λ of 595/202 (non-gap columns/possible conservation patterns). To find a threshold for significance for the 202 possible conservation patterns, an initial α = 0.1 was divided by 202. The cumulative probability of a conservation pattern occurring n times exceeds 1 − (0.1/202) at 10 columns, so any conservation pattern with more than 10 columns was treated as significant.

The expected values for the frequency of each conservation pattern were based on the frequency of conserved residues for non-gap positions for each protein. First, assuming conservation between proteins is independent, the probability of any set of proteins being conserved was estimated as the product of probabilities of conservation for the proteins conserved multiplied by the product of the probability of each protein not conserved not being conserved.

For each of these sets, the possible conservation patterns were generated by finding all possible partitions of the set. The relative probabilities of each of these partitions was calculated assuming the residues were conserved for each column independently, so the probability of any two proteins conserving the same residue was 0.05. For each partition, the probability of a column being conserved that way, given that set of proteins is conserved, is then 19! (20−*m*)!20*n*−<sup>1</sup> , where n is the size of the set and m is the number of parts. To obtain estimates for the expected value for each conservation pattern, these values were then multiplied by the probability of that set of proteins being conserved, then multiplied by the number of non-gap positions.

**Supplementary Materials:** The following are available online at https://www.mdpi.com/1422-006 7/22/6/3012/s1, Supplementary Figure S1: Overall alignment properties; Supplementary Figure S2: The conservation pattern (ABCG1, ABCG4), (ABCG2), (ABCG5), (ABCG8) on ABCG5/G8; Supplementary Figure S3: Serines and threonines in the corkscrew; Supplementary Figure S4: Binding pockets of ABCG2, ABCG5, and ABCG8; Supplementary Figure S5: Well-populated type II divergence patterns; Supplementary Table S1: Protein Sequences used to identify functionally divergent positions; Supplementary Table S2: Number of columns with a given conservation pattern across the whole alignment; Supplementary Table S3: Partial contingency table for conservation patterns in the polar relay; Supplementary Table S4: Frequency of conservation patterns in the triple helical bundle; Supplementary Table S5: Frequency of conservation patterns at dimer interfaces; Supplementary Table S6: Residues contributing to the binding pockets of ABCG2 and ABCG5/G8; and Supplementary Table S7: Mutations to positions with conservation patterns of interest, which cites [20,34,37–46,48–51,53–57,64–68].

**Author Contributions:** J.I.M.-W. designed and carried out sequence analysis. J.I.M.-W., I.D.K., N.H., S.J.B. and T.S. contributed to the final manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** I.D.K., S.J.B. and N.H. were funded by a BBSRC grant (BB/S001611/1).

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Python code used in this article is available at https://github.com/ kuraisle/ABCG\_Family\_Analysis (accessed on 15 March 2021), which also includes the sequence alignment used and instructions on using the code to explore it.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
