*4.1. Sequence Acquisition*

Through the NCBI, the RefSeq database [58] was queried for all nucleotide sequences matching "ABCG AND mammalia [organism]". Analysis was restricted to mammalia to afford greater confidence that function corresponded to the identity of the protein. Initially, 778 sequences from 112 species were identified. Not every species had a full complement of sequences for ABCG1, ABCG4, ABCG2, ABCG5 and ABCG8, so, where possible, these were found in the RefSeq database and added manually. A matching list of protein sequence IDs were used for a submission to Entrez. An in-house Python [59] script was used to check for and remove identical sequences.

Further sequences from some species were removed to prevent sequences from closely related species biasing later analysis. For example, sequences from 29 primates made up a high proportion of the total number of sequences, but presumably a low proportion of the organismal diversity. For this reason, 25 of the sequences were removed, keeping one ape (*Homo sapiens*), one monkey (*Piliocolobus tephrosceles*), one gelada (*Theropithecus gelada*), and one lemur (*Microcebus murinus*). Similar reasoning was used to reduce the number of species to 40. When choosing species to keep, a series of criteria were used. First, any wellstudied species (e.g., *Homo sapiens, Mus musculus*) were retained. Next, species where one or more ABCG sequences were only tentatively identified (e.g., deposited in the database with the caveat "LOW QUALITY PROTEIN", or that were somewhat shorter than the canonical length of ABCGs (ca. 650 amino acids) were eliminated in preference to species with higher quality sequences. A preliminary alignment of all sequences using multiple alignment fast Fourier transform (MAFFT) was performed. This alignment was processed using MaxAlign, which identifies sequences that align most poorly with the others. If sequences from a species aligned poorly, they were disfavoured in the elimination process. In some cases, a species without an obvious substitute was eliminated—for example, the African elephant has only two ABCG sequences and both are low-quality sequences which aligned poorly. For this reason, the final number of species was reduced to 35. Where species could not be distinguished using these criteria, a random integer between one and the size of the set being reduced was generated, and the sequence matching that number in alphabetical order was kept. A summary of the sequences used can be found in Supplementary Table S1.
