**1. Introduction**

The cytochromes P450 (CYP) constitute the largest superfamily of hemoproteins, which have been studied since the late 1940s (see [1,2] for an historical survey). With the emergence of genomic data and the quickly growing number of P450 sequences, the superfamily has been phylogenetically classified in families, subfamilies and individuals, respectively, denoted by the first identification number, the letter following it and the second identification number (e.g., in human 1A2, 3A4, etc.). More than 18,000 P450s sequences in all living kingdoms were recognized in 2013 [3], but, since then, the number of known sequences keeps growing and can be estimated to be higher than 300,000 taking into account the current plant genomic projects [4]. CYPs are found in many bacteria, plants and animals. It is estimated that, in human or mammal metabolism, 75% of drug transformation reactions involve catalysis by P450s [5,6]. The secondary and tertiary structures of the CYPs have largely been

conserved throughout evolution [7]. The crystal structures of a large number of CYPs, in both the free and the substrate-bound forms, have been solved. The core is highly conserved within the structural fold, and the heminic active site of the CYP is buried inside the enzyme [7].

The human genome encodes 57 CYP isoforms, that play a major role in the biotransformation of drugs, pesticides or many other chemicals, and in the metabolism of endogenous compounds such as steroids and vitamins [8,9]. The detoxification reaction mediated by these CYPs can yield reactive intermediates which can damage DNA, as well as lipids and proteins [10], while the alteration of their activity often leads to serious diseases [11]. Tables of substrates of human isoforms are available [6,12–14]. Despite several attempts to predict substrates of CYPs [15–19], no clear prediction rule is known. It is known that lipophilicity can play a crucial role [20] and a summary of substrates/selectivity rules is proposed [21]. However, there is an urgent need to improve the accuracy, interpretability and confidence of the computation models used in drug discovery process (see [19] and references therein).

In this paper, we consider the human isoform 3A4. It lies in the human liver and is estimated to contribute to the phase I metabolism of roughly half of the drugs on the market [22,23]. The other isoforms accounting for more than 90% of the oxidation of drugs are 1A2, 2A6, 2C9, and 2D6 [24]. Although most of the CYPs have a binding allosteric site for their substrates which reversibly accommodates one molecule of substrate at a time, 3A4 isoform can accommodate more than one molecule in its binding site at the same time [25]. Recent advances about CYP3A4 show limited information on the pathways to the heminic site [26,27].

P450s catalyze an oxidation where the substrate binds in the active site on the distal side of the heme. Although the oxidation step has been investigated for a long [28–31], the ingress and egress of the compounds to and from the active site remain unclear. Structural flexibility is essential to allow chemical compounds to get in and out the active site, and it was shown that it correlates with substrate preferences for several CYPs, including for 3A4 [32].

Few biophysical and biochemical approaches have been proposed by wet biology teams to experimentally address the role of ligand access channels, as reviewed in [33], but never for CYP3A4. To our knowledge, only one article presents clear-cut results suggesting that a ligand diffuses through a given channel and not another one [34]. The authors mutated selected residues in one of the channels (double mutant Y309C/S360C) to introduce cross-linking by disulfide bond that resulted in one channel closure. They could then measure the kinetics of metabolism of two different substrates (benzphetamine and 7-EFC, i.e., 7-ethoxy-trifluoro-coumarine), and show that the double mutant exhibited unchanged activity for benzphetamine (98% of wild type activity), while it dropped to 19% compared to wild-type activity for 7-EFC, indicating that the two substrates do not cross the same channel to access the active site. This experiment necessitates choosing carefully the two residues to mutate, and obtaining an active form of the recombinant enzyme, which is never obvious.

Molecular dynamics based studies of the channels of several CYPs were performed [32,35–55]. Some of them were applied to 3A4 [32,42–45,48,51,53,55]. However, these studies do not lead to a consensus on the number and type of channels for a given isoform. Due to the prohibitively long time scale required to observe opening or closing of channels, we preferred to use a rapid geometric method to identify through which paths are travelling the compounds. Identifying and characterizing the access channels and their lining amino acids is not a trivial task because the channels can dynamically open/close in response to water or ligand passage and enzyme breathing motions [56].

In this context, we computed cavities and channels with CCCPP, which takes in account both the size and the shape of the ligands through a cylindrical model of the ligands, proved to be more realistic than the spherical model used almost everywhere in the literature [57]. The effect of the ligands conformational flexibility on their shape was taken in account in preliminary studies [58,59]. Then, we found that the 3A4 isoform has three major conformations while only two conformations are considered in the literature [60].

For the present study, we built version 2 of CCCPP to perform a refined analysis of the channels. Unless otherwise stated, further mentions of CCCPP refer to its enhanced version. The secondary structure of the members of the P450 family is described by a nomenclature defined by Poulos et al. [61] (see also Figure 1 in [62] and Figure 2 in [63]), which is widely used in the P450 scientific community and that we use throughout this paper. The major description of the channels [64] is based on a geometric method in terms of the secondary structures at which there is an egress of the computed channels. These channels were computed with the software CAVER [65], and gave rise to a channel nomenclature which is still in use [64]. However, there are dozens of other cavities and channels calculation softwares (see [57] for a review). They give a variety of results due to the diversity of output data structures. Thus, it is rather difficult to compare these results. For example, the CAVER based nomenclature established in [64] from all CYP crystal structures available in the Protein Data Bank (PDB) in March 2006 (total: 143 PDB files, 192 chains, 26 CYPs) indicates 14 channels, while CYP3A4 alone (PDB code 1TQN, 1 chain, included in the 2006 study) gave rise in 2012 to 21 channels with MOLEonline 2.0 [66], one of the successors of CAVER. In fact, only three channels were attributed to 3A4 in the 2006 study. Such discrepancies are not unusual. They appeared also between the P450*cam* (CYP101) channels calculated with CAVER and MOLEonline and the ones computed in [67], and the situation remains unclear despite the help of several molecular dynamics simulations [35,37]. The experimentalists are left with dozens of published software packages and they have to face to a huge of potentially contradictory results about the channels they are looking for: Which channels should be retained? Easy and rapid comparisons are needed. Giving the name of the secondary structure at which there is a channel egress does not suffice to describe the channels. For a given CYP chain, most of the channels have common parts. Thus, in our opinion, the network of channels should be described with the help of graph theory tools, in terms of paths along nodes and edges, as done in the present study. To compare these networks for different input CYPs, it is better to give a full description of the channels in terms of protein heavy atoms and residues, not only at the egress locations of the channels, but also all along the channels. These functionalities were unavailable in the original version of CCCPP described in [57]. Thus, no more visual examination of the secondary structures is needed to locate the egress of the channels, as it was needed with CAVER. Moreover, the lists of atoms and residues are returned by CCCPP, plus the data structure defining the boundary of each channel. This latter functionality was also available in the version 1 of CCCPP. Throughout this paper, channels named 1, 2a, 2b, etc., refer to the nomenclature of Cojocaru et al. [64] based on the secondary structures elements at the protein surface where the channels emerge.

#### **2. Methods**
