*4.3. Calculation of Conservation*

First, columns in which at least 10% of the total alignment, or 30% of one protein (e.g., ABCG1 sequences) were gaps, were labelled "Gap" and excluded from further analysis. Next, the conservation of the column across the whole alignment was calculated. Detecting conserved residues was based on information theory. Following Capra and Singh [61], the Shannon entropy of a column (i.e., a position in the multiple protein sequence alignment) was calculated. For amino acids in a column, entropy can take values between zero (all sequences are the same amino acid) and log2(20) (each amino acid is equally likely). If entropy was lower than 2/3 of a bit, the column was counted as conserved.

If the Shannon entropy of the column for the whole alignment was <2/3 of a bit, the column was labelled as "All proteins conserved". Columns not labelled as "Gap" or "All proteins conserved" were then analysed by protein, e.g., the Shannon entropy was calculated for the column just in the ABCG1 sequences, or ABCG4 sequences. If it was not conserved (i.e., if the Shannon entropy within any of the proteins was <2/3) the position in the alignment was labelled "Not Conserved". If a column was conserved in one or more proteins, the most common residue found in each protein was recorded. Each of these columns was recorded as a list of pairs of conserved residues and the proteins matching that residue at that column. For example, column 1011 in the alignment corresponds to the well-studied residue 482 in ABCG2. This is conserved in all ABCGs, but differently—in ABCG1 and ABCG4, it is glutamine; in ABCG5, it is serine; and in ABCG8, it is histidine—so the record for that column is:

(1011, [('R', [ABCG2]), ('S', ['ABCG5']), ('Q', ['ABCG1', 'ABCG4']), ('H', ['ABCG8'])]). To display a summary of sequences, logos were constructed using LogoMaker [62]. The positions in a protein corresponding to columns of interest were displayed on the structure of ABCG2 PDBID: 6VXF [15] using ChimeraX [63].
