*2.1. Overall Conservation Patterns*

A total of 174 ABCG protein sequences (summarised in Supplementary Table S1) were analysed. These were grouped according to the protein they represent, and their conservation calculated as described in methods. A tree constructed from these sequences showing the relationship between the ABCG proteins is shown in Figure 1a. The alignment had a

length of 1269 positions (henceforth "columns"). Of these, 674 columns had gaps in either >10% of all sequences or >30% of sequences for one of the proteins (see Supplementary Figure S1a). Of the remaining 595 columns, 594 met the entropy cutoff for conservation in at least one protein. A total of 61 of these columns were conserved across the ABCG family, and the remaining 533 had some type of divergence, as summarised in Figure 1b.

**Figure 1.** (**a**) Phylogenetic tree of mammalian ABC subfamily G proteins. Tree based on 174 protein sequences, aligned with multiple alignment fast Fourier transform (MAFFT). Names of taxa have been removed for clarity. (**b**) Pie chart showing proportions of conservation and divergence. In the 594 columns showing conservation in at least one protein in the G subfamily, 61 are totally conserved (grey); 52 show simple type I divergence (where one set has conservation, and the others do not) (green); 193 show type II divergence (where each set is conserved, but with a different residue) (cyan); and the remaining 288 have some mixture of divergence (e.g., column 891 is a conserved cysteine in ABCG2, and a conserved leucine in ABCG1 and ABCG4, but is not conserved in other groups. Thus it has neither purely type I nor type II divergence) (red).

An example of conservation is represented in Figure 2. It shows one part of the interface between TMD and NBD which is vital for transmitting energy from ATP hydrolysis in ABCGs, often referred to as the "elbow helix" in ABCG literature. Columns in this region of the alignment display the different types of conservation of relevance; firstly columns that show total conservation, where not only is the column conserved for each protein, but it is conserved in the same way (e.g., column 900 in Figure 2 where all ABCG sequences conserve arginine at this position). Secondly, it shows type I divergence, where the column is conserved as the same amino acid for at least one protein, with other proteins not conserving the column. For example, column 895 in Figure 2 is conserved as a cysteine in ABCG1 and ABCG4 but is not conserved in ABCG2, ABCG5 or ABCG8. Thirdly, type II divergence, where each protein shows conservation, but different proteins can be conserved in different ways is evident in columns 893, 894, 897, 901, 904 and 905 in Figure 2. For example, in position 905, ABCG1 and ABCG4 conserve isoleucine, ABCG2 and ABCG5

conserve leucine and ABCG8 conserves aspartate. Finally, several other columns display a mix of types of divergence; for example, column 890 shows a position conserved only in ABCG1, ABCG2 and ABCG4, and the residue conserved is different in all cases.

**Figure 2.** Conservation in the alignment of ABCG protein sequences. (**a**) Sequence logo in which sequences have been divided by the protein they represent. Font size corresponds to the fraction of sequences with that residue in that column. Conserved positions have coloured backgrounds so that totally conserved columns are grey, columns with type I divergence are green, columns with type II divergence are aqua, and columns with mixed divergence are red. Conservation patterns as described in the text are shown at the bottom. (**b**) Structure of ABCG2 (PDBID: 6vxf) highlighting the area represented in the logo. This corresponds to the elbow helix in ABCG2.

Many of the approaches used to investigate functional divergence return a score for each column reflecting how it is conserved across the whole alignment and some are limited to a comparison between two groups. In the case of the ABCGs, one aspect of functional divergence worth exploring might be ABCG2′ s broader substrate specificity. If comparing two groups, examining both type I and type II divergence is worthwhile. However, the substrates transported by ABCG1 and ABCG4 differ from those transported by ABCG5/G8, so their substrate specificities are achieved in different ways. Considering possible functional divergence within ABCG members highlights some of the difficulties with terminology. Here, we have defined type II divergence to include any column in which each protein is conserved, without conservation across the whole family, and type I divergence to include columns in which one or more proteins have the same amino acid conserved, and all other proteins are not conserved. Rather than calculating scores for the whole alignment, we have classified columns according to the proteins in which they are conserved, allowing inferences to be drawn from differences between multiple groupings.

In this manuscript, the different ways to group proteins to examine their conservation is referred to as a conservation pattern. Columns with a particular conservation pattern are represented by having any family members conserved in that column written in brackets.

If more than one member has the same amino acid conserved at that position, they are held in the same brackets, separated by a comma. To illustrate this nomenclature with respect to Figure 2, among the conservation patterns visible in this section of the alignment are (ABCG1, ABCG4), (ABCG2), (ABCG5), (ABCG8) in column 901 and (ABCG1, ABCG4) (ABCG2, ABCG5, ABCG8) in column 904, both of which represent type II divergence.

There are 202 theoretically possible conservation patterns, of which around half (106) are observed anywhere in the actual alignment. Most of these have very few representatives, with over 60 only having 1–3 representatives. Remarkably, almost half of the divergent positions in the alignment are contributed by just 14 different conservation patterns (Supplementary Table S2). Some of these well-represented patterns have implications for functional divergence when the relationships between the proteins are considered.
