3.1. Comparison of the nLC-MS Files
Three digestions were performed on three biological replicates, yielding nine samples. In order to enable meaningful comparisons across proteases, every sample preparation step was kept rigorously identical for each sample (protein extraction, reduction and alkylation, protein amount digested, dilution factor, SPE desalting and LC-MS analyses), with the exception of the digestion steps themselves where optimum conditions were applied as recommended by the manufacturer to maximise protease efficiency. In particular, unique protease:protein ratios (1:50 for A, 1:100 for C and 1:25 for TL), tailored digestion buffers (50 mM Tris pH 8.0 for TL, 100 mM Tris/10 mM CaCl2 pH 8.0 for C and 50 mM ammonium bicarbonate pH 7.8 for A), different temperatures (37 °C for A and TL and 25 °C for C) and two digestion times were employed (1 h for A, and 18 h for C and TL). Alternative digestion parameters might lead to improved results and the reader is encouraged to test them; this is however outside the scope of this work.
All nine nLC-MS maps show reproducible diagonal separation patterns (
Figure 2A) demonstrating that low m/z features eluting early are less hydrophobic whereas features characterized by higher m/z elute are late and are therefore more hydrophobic.
Digest patterns resulting from the action of rAsp-N or trypsin/Lys-C occupy most of the retention time and m/z windows, whereas chymotryptic patterns do not exploit such large windows with only a few peptides greater than m/z 1200 and eluting after 35 min. The depletion in late-eluting hydrophobic peptides released by chymotrypsin is also very evident on the base peak chromatograms (BPC,
Supplementary Materials Figure S2). The elution window of the most abundant peptides is protease-specific; prominent peptides elute from 20–30 min when rAsp-N is used, from 15–27 min when chymotrypsin is used and from 17–30 min when trypsin/Lys-C is employed (
Supplementary Materials Figure S2).
The signal intensities are comparable across all samples and stretch from 4 to close to 10
8 (
Figure 2B). A violin plot of the cluster volumes further indicates that the bulk of the volumes range from 10 to 1000 (
Figure 2C), with tryptic/Lys-C digests displaying slightly greater volumes. A principal component analysis (PCA) plot demonstrates the high levels of reproducibility of the biological triplicates tightly grouped together, albeit to a lesser extent when chymotrypsin is employed (
Figure 2D). The fact that the three digest profiles form a triangular shape on the PCA plot also illustrates how distinct and complementary they are from each other, a testament to their orthogonality.
The numbers of MS and MS/MS scans per sample are listed in
Table 2, along with the number of clusters.
The numbers of MS scans range from 10,391 (bud2_C) to 13,423 (bud1_TL), the numbers of MS/MS scans fluctuate from 8458 (bud2_C) to 11,828 (bud1_TL), and 82,091 (bud2_C) to 91,784 (bud1_A) clusters can be resolved under our nLC-MS conditions. Those numbers are comparable to what was previously reported [
1,
2], unsurprisingly given that the protein extraction and analytical methods are exactly the same.
Based on the averages indicated in
Table 2, the three proteases rank as follows: TL > A > C. Standard deviations (SDs) are low and coefficients of variation (CV) inferior to 7%. This ranking coincides with the protease:protein ratios (1:25 for TL > 1:50 for A > 1:100 for C); therefore, it would be interesting to repeat this experiment by keeping a consistent protease:protein ratio (for instance 1:100, which is often used in shotgun proteomics) and verifying whether it evens out all the differences highlighted above.
3.2. Database Search Yield and Duration
Protein sequences are the fundamental determinants of biological structure and function. A protein sequence database is required to match an acquired spectrum to its theoretical counterpart. The database comprises the AA sequences of all proteins that are expected in the sample. This is why specific protein databases arising from genome sequencing projects of the species of interest are ideal. However, if that is not readily available, sequences from a closely related species must be explored. Issues related to the database search include variant proteins, sequencing errors or homologous proteins from another species [
43]. Several
C. sativa genomes have been sequenced [
35,
36,
37,
38], and the number of predicted gene models varies from 27,819 to 34,589 [
33].
In our previous experiments, a small
C. sativa FASTA database retrieved from UniProt Knowledge Base was searched to identify proteins from medicinal cannabis apical buds [
1,
2,
3]. This database corresponded to what we refer to as Uniprot515 in the present study. Whilst highly specific, extremely well annotated and originating from a high reputable source, such a small database does not capture the genetic richness of this very unique plant and clearly underrepresents the actual number of
C. sativa proteins.
To overcome this shortfall, we have retrieved two significantly larger
C. sativa FASTA databases from various sources, which we named homemade95k and JO29k (
Table 1). We have also searched the non-specific but curated SwissProt
viridiplantae (green plants) database (SPGP40k), which comprises only 19
C. sativa accessions out of close to 40 thousand entries, to test what was gained or lost when the search space was not limited to the species of interest.
A continued focus of ours are the phytocannabinoid and terpenoid pathways; 21 of the enzymes involved in these metabolisms and reviewed in UniProt (i.e., originating from SwissProt) are gathered in a minimalist FASTA database called SP21. In all, we searched five databases of varying sizes and specificity (
Table 1).
All the proteins identified using the five databases and the two algorithms (SEQUEST and Mascot) for the nine samples are listed in
Supplementary Materials Tables S1–S5. When searching the SP21 database, 18 accessions out of 21 (85.7%) are identified across all nine samples. A search with Uniprot515 database produces 72 accessions out of 515 (14.0%) sequences. Exploring the JO29k database yields 1343 accessions out of 29,057 (4.6%). Using the largest database, Homemade95k, 1442 accessions out of 95,069 (1.5%) are found across all nine samples. Finally, interrogating the less specific SPGP40k database leads to the identification of 819 accessions out of 39,800 (2.1%) entries. Identification results are summarized in
Table 3.
The number of identities varies from 9 (SP21 Mascot bud123_A) to 1322 (Homemade95k SEQUEST bud1_TL) (
Table 3). Within a given database, that number fluctuates by up to 58% (from 549 to 1322 in Homemade95k) across the nine samples. Of course, if we focus our attention on one database, one search engine and one digestion, the number of accessions identified is much more comparable (CVs < 11.5%), further confirming the acceptable reproducibility noted above across the biological triplicates.
There is a clear positive relationship between the number of proteins identified and the size of the database. Considering SEQUEST results only since Mascot could not be applied to JO29k, on average 15 proteins are identified using the SP21 database (72.5% of all the database entries), Uniprot515 produces 54 identifications (13.2%), 1006 accessions are listed with JO29k (3.5%), Homemade95K yields 944 identities (1.2%) and 653 proteins are identified using SPGP40k (1.6%).
The percentages listed in
Table 3 correspond to the number of identities relative to the total number of entries in the database searched; those values show that the greater the database the smaller the percentage. For instance, up to 76% of the accessions listed in the smallest database, SP21, are matched (SEQUEST bud1_TL), but as little as 0.6% of the entries comprised in the largest database, Homemade95k, are identified (Mascot bud3_C). These opposite trends between number of identities and percentages per database are better visualised in the histogram in
Supplementary Materials Figure S3A.
The UniProt Knowledge Base (
https://www.uniprot.org/) collates data from SwissProt and TrEMBL, thus providing annotated sets of protein sequences, predicted from sequenced genomes for many species, in particular model organisms. Manually curated and reviewed protein sequences emanate from SwissProt, whereas automatically annotated and unreviewed proteins originate from TrEMBL. Searching against SwissProt thus ensures that the identifications are based on high quality protein information [
44]. Limiting the search space to the set of sequences expected in the sample by restricting the database to the species of interest increases the biological relevance of the results. A specific taxonomy can be selected from the UniProt website which covers model organisms better than non-model species such as
C. sativa, where species-unique proteins are missed. In such cases a related species with similar sequences is to be used; alternatively, if no close relatives exist in UniProt, a whole taxum can be searched. For less studied plant species, the
viridiplantae taxa is the best taxonomy offered by UniProt. In the case of
C. sativa, there are currently 19 reviewed (SwissProt) entries and 494 unreviewed (TrEMBL) entries hosted in the UniProt repository. In this work, 72 accessions from the UniProt repository are identified overall, including 17 accessions from SwissProt (
Supplementary Materials Table S2).
The NCBI protein database (
https://www.ncbi.nlm.nih.gov/protein) gathers sequences from several sources (GenBank, RefSeq and Third-Party Annotation (TPA), SwissProt, PIR, PRF and PDB) and makes them publicly available. GenPept translations exist for each of the coding sequences within the GenBank Nucleotide database; consequently, more than one protein sequence might correspond to a nucleotide sequence record. When UniProt builds become available, they are loaded into NCBI. The RefSeq project at the NCBI (
http://www.ncbi.nlm.nih.gov/refseq/) has several missions: maintaining and curating annotated genomic, transcript and protein sequence records; leveraging data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation and collaboration to produce a standard set of stable, non-redundant (nr) reference sequences; adding references to publications, functional features and informative nomenclature [
45]. GenBank is a public repository of DNA sequences built from community data submissions to INSDC, as well as daily data exchanges from the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA) and GenBank at NCBI [
46]. We retrieved from the NCBI repository 37,654
C. sativa AA sequences that are fully annotated, including 36,521 accessions from RefSeq, 899 entries from TrEMBL and 234 sequences from SwissProt. Overall, 834 accessions originating from the NCBI database are identified in this work (
Supplementary Materials Table S4), one from SwissProt and all the others from RefSeq.
The Medicinal Plant Genomics Resource (MPGR,
http://medicinalplantgenomics.msu.edu/) stems from the Medicinal Plant Consortium (MPC). Initiated in 2010 and funded by the National Institutes of Health (NIH), MPC gathers 13 collaborating research units from 7 institutions. MPC aims to provide publicly available transcriptomic and metabolomic resources for 14 key medicinal plants for the worldwide research community for the advancement of drug production and development. It wishes to bridge the gap between genomic information and the highly specialized secondary metabolisms of plants with promising medical applications such as
C. sativa. A total of 57,411
C. sativa AA sequences are available from MPGR, which exceeds the genetic richness mentioned above and therefore might host redundant sequences. MPGR accessions lack protein descriptions, which we added by applying the blastp sequence alignment algorithm [
42] to the GenBank nr database. Overall, 608 accessions originating from the MPGR database are identified in this work (
Supplementary Materials Table S4). Most of them (530, 87%) match a
C. sativa protein, 22 (4%) match accessions from
Trema orientale, and 17 (3%) match proteins from
Parasponia andersonii. Both species belong to the
Cannabaceae family and are closely related to
C. sativa [
47].
The JO29k database created by Jenkins and Osburn [
39] is also publicly available and contains 29,057 entries. The authors employed trypsin to digest protein extracts from various tissues from various cultivars of
C. sativa plants followed by a fractionation method prior to shotgun LC-MS/MS analyses. By maximizing sample diversity (genetic backgrounds, vegetative and reproductive tissues) and by prefractionating the tryptic digests, they managed to not only achieve extensive proteome coverage with the identification of 17,269 open reading frames but also validate genome annotations using proteogenomics. The authors do not indicate how many proteins were identified in mature female flowers. In our study, using the JO29k database, we identified 1343 accessions in mature buds.
While it is obviously advantageous to search larger specific databases since they generate longer lists of identifications, it is computationally taxing, particularly when dynamic modifications are added and an unlimited number of miscleavages is allowed, as was done in this study, because all those parameters greatly increase the search space.
Table 4 details the search durations for each sample.
For the databases containing less than a thousand entries, search durations take minutes, whereas hours are needed when several thousands of entries are interrogated. For the smallest databases, SP21 and Uniprot515, search durations span from 8 to 12 min and 16 to 26 min, respectively (
Table 4). For databases of comparable size, such as JO29k and SPGP40k, search durations fluctuate from 19 min to 1 h 22 min and from 2 h 18 min to 4 h 17 min, respectively. The largest database, Homemade95k, necessitates the longest search durations, from 5 h 21 min to 25 h 28 min (
Supplementary Materials File F1).
Table 4 has been converted into a histogram for ease of interpretation in
Supplementary Materials Figure S4A.
Interestingly, proteases also influence the amount of time the searches take. Above a critical database size (to be determined but from this experiment anywhere between 515 to 29,057 entries), searches take up to three times longer for rAsp-N-released peptides than for tryptic and chymotryptic peptides (
Table 4 and
Supplementary Materials Figure S4A). We have averaged all search durations across each of the five databases to produce
Supplementary Materials Figure S4B, which shows a marked increase in the duration of the search as a function of the number of entries.
The search engines also perform differently with a clear advantage for SEQUEST over Mascot when the large database Homemade95k is searched. For instance, SEQUEST searched rAsp-N-released peptides three time faster than Mascot (
Table 4 and
Supplementary Materials Figure S4A).
3.3. Comparison of Proteases and Their Proteolytic Efficiencies
In this study, we used three orthogonal digestions with proteases of increasing selectivity levels, chymotrypsin (C), trypsin/Lys-C (TL) and rAsp-N (A).
The success identification rate follows the order previously observed with the number of MS and MS/MS scans (
Table 2). Typically, more accessions are identified when using trypsin/Lys-C, followed by rAsp-N and lastly chymotrypsin (
Table 3 and
Supplementary Materials Figure S3A). The difference in identification success among proteases becomes also more evident with larger databases. The Venn diagrams in
Supplementary Materials Figure S3B further exemplify this with the Homemade95k database, as well as indicating how many of the accessions are unique to each of the proteases or shared among them. For instance, when applying the SEQUEST algorithm, 1108 accessions are identified with TL, 674 with A and 385 with C. Only 265 (17%) accessions are shared among TL and A, 79 (5%) among TL and C and 17 (2%) among A and C. A total of 242 (11%) accessions are common to all proteases. The Venn diagram for Mascot is very similar. Even though some proteases yield longer lists of identities, in particular TL, they all complement each other, as attested by the high number of protease-specific identities (e.g., for SEQUEST 522 TL-specific, 150 A-specific and 47 C-specific protein accessions). This is expected because rAsp-N, trypsin/Lys-C and chymotrypsin are completely orthogonal, target different AA residues and consequently produce unique peptides (
Supplementary Materials Table S1).
Protease complementarity was also observed in our previous study where we tested single, double and triple digestions using orthogonal proteases [
2]; taking olivetolic acid cyclase (OAC) as an example, we illustrated that full coverage of its AA sequence could only be reached by combining the sequencing data from all the proteases since none of them individually produced 100% coverage. Similarly, in the present study a wider coverage is achieved upon merging all the sequencing information produced by the A, C and TL proteases (
Supplementary Materials Tables S1–S5). From these results and those obtained in our previous multiprotease experiment [
2], we conclude that trypsin/Lys-C is the best single digestion method systematically yielding the largest number of identifications regardless of the database used.
Other studies have applied a multiple protease strategy to increase the proteome depth and sequence coverage. As early as 2002, Choudhary and colleagues demonstrated that 93.9% coverage of a recombinant tissue plasminogen activator could be achieved by the combination of trypsin, Lys-C and Asp-N, covering respectively 88.2%, 62.8% and 34.9% of the 527 AA sequence [
11]. Whilst covering the least, Asp-N proved essential as it spanned regions of the recombinant protein that were not explored by either trypsin or Lys-C. A similar observation was made on bovine serum albumin [
48]. Swaney and colleagues compared trypsin to highly selective proteases, namely ArgC, AspN, GluC and LysC, and observed that while trypsin yielded the greatest number of unique identifications, the alternative proteases identified different proteins thus augmenting the proteome depth [
49]. Asp-N ranked second with respect to the number of unique peptides identified, albeit achieving a lesser sequence coverage.
We mentioned above that perhaps applying similar protease:proteins ratios during the digestion step might lead to comparable success rates among the different proteases. We also need to factor in protease proteolytic efficiencies or how effectively proteases find their target AA residue and cleave their substrate. This is assessed by the number of missed cleavage sites. The manufacturer Promega (
https://www.promega.com.au) ranks proteolytic efficiencies as follows: TL > A > C (
Figure 1). The website stipulates that trypsin/Lys-C yields less than 10% missed cleavages of R and K residues, thus realizing more than 90% efficiency; rAsp-N achieves 85% digestion efficiency (no missed cleavage) of D residues after 1h. Under our conditions (1M Guanidine-HCl), chymotrypsin loses 20% cleavage efficiency (
https://www.promega.com.au) of Y, F and W residues. We must also consider variations in fragmentation efficiencies of the peptides as both trypsin/Lys-C and chymotrypsin leave a proton on the peptide C-terminus, whereas Asp-N leaves it on the N-terminus of the released peptides.
The SEQUEST search program allows for up to twelve miscleavages whereas Mascot only allows for up to nine miscleavages. We have previously discussed the benefits of setting a number of miscleavages greater than two [
2], particularly in the context of middle-down proteomics.
Table 5 presents the distribution of missed cleavage sites observed in our experimental data.
The majority (60–89%) of the peptides matched in this study do not contain any miscleavage. However, a significant proportion (11–40%) does include missed cleavage sites, indicating that our digestions are incomplete. This is further confirmed by subtracting the number of matched peptides with a missed cleavage site (miscleavage > 0) from the number of matched peptides without missed cleavage (miscleavage = 0) to compute the excess of limit-digested peptides (ELDP) [
50]. If the proteolysis was total, the ELPD values indicated in
Table 5 would be much smaller.
The fact that the digestion is incomplete is not an issue in our study. It just warrants allowing for more miscleavages in the search parameters, which will result in longer search times, as was discussed above. However, this is also advantageous in an MDP context where more missed cleavage sites create longer peptides and ultimately greater sequence coverage, as was demonstrated in our previous study [
2] and confirmed in this new study. The peptide sizes (i.e., masses) are reported in
Table 6.
Identified peptide masses range from 604.3 D (Homemade95k) to 7600.9 D (SP21) and they average 2032.5 D with a huge standard deviation (SD,
Table 6A), indicating that the size of many peptides falls outside the average mass. There is a trend that the larger the database, the smaller the identified peptides.
If we take a closer look at the protease level, rAsp-N produces the longest peptides, averaging from 2.0 to 2.5 (+/− 0.9–1.2) kD (
Table 6B). This is expected because rAsp-N is highly selective and targets only the N-terminus of D residues. The largest peptide originates from the action of rAsp-N on CBCAS (WO/2015/196275Al), weighs 7.6 kD, hosts only one miscleavage and matches the following AA sequence: DLFWAIRGGGGENFGIIAACKIKLWVPSKATIFSVKKNMEIHGLVKLFNKWQNIAYKYDK.
Proteases that are less selective and target multiple sites such as trypsin and chymotrypsin produce shorter peptides averaging 1.9 kD. The longest peptides arising from the action of C or TL present more missed cleavages. For instance, the chymotryptic peptide EILSGKSRGAAAATESLTDSSAEFGETSSSISSSEISTEDVKVKGSSSPPHLGWPIRRADVRKSF from the C. sativa rop guanine nucleotide exchange factor 5-like protein (XP_030490016.1) weighs 6954.3 D and contains 5 miscleavages. In another example, the tryptic peptide VSRLDLKKLRFGAANRYGFRVGLGKTHLSANFSDEVASWKKFRNQR from the C. sativa uncharacterized protein LOC115699895 (XP_030483299.1) weighs 5539.9 D and carries 10 miscleavages.
In our previous shotgun study, we evidenced the positive relationship between the number of miscleavages and the size of the peptides [
2]. We discussed how advantageous this feature is in a middle-down proteomics context and recommended applying a number of missed cleavages greater than two, as is usually the case, during the database search stage. This is also confirmed in the present work. Swaney and colleagues applied up to three missed cleavages and reported increased length of Asp-N-released peptides relative to that of tryptic peptides [
49]. Giansanti and colleagues observed 0–2 miscleavages for trypsin and Lys-C and 0-4 miscleavages for Asp-N and chymotrypsin [
48]. Cristobal and colleagues reported Asp-N-released peptides bearing more than four miscleavages and a greater median size than tryptic peptides [
51].
3.4. Comparison of the Search Algorithms
A plethora of search engines are available to the proteomics community to turn tandem mass spectra of peptides into AA sequences [
22,
25,
27,
28,
29]. All of these algorithms rely upon the same fundamental elements: read protein sequence databases, emulate enzymatic cleavage to peptides, extrapolate PTMs, apply a tolerance of observed precursor and fragment masses, predict fragment ions for each peptide sequence, and compare observed and expected fragments [
29]. In our past shotgun proteomics studies [
1,
2], we used SEQUEST, which was designed for instruments manufactured by Thermo Scientific such as the Elite LTQ-orbitrap mass analyser employed here. In this study we compare two of the most commonly used search algorithms, SEQUEST and Mascot.
The SEQUEST program was created in 1994 to correlate tandem mass spectra of digested protein mixtures from a yeast cell lysate with AA sequences hosted in a database [
23]. Amino acid sequences are converted into a fragmentation pattern used to match fragment ions in a MS/MS spectrum. The number of peaks the sequence shares with the experimental spectrum are counted to generate the SEQUEST preliminary score or Sp [
23]. Two key calculations assess whether a peptide sequence is a confident match for a fragmentation spectrum: 1) XCorr, a statistical calculation of the correlation of the theoretical and experimental spectra and 2) ΔCN, the difference between the top peptide spectrum match (PSM) and the second best PSM [
8]. SEQUEST was exclusively licensed to Thermo Scientific instruments and incorporated into Proteome Discoverer 1.4 package [
29]. Since its inception in 1994, SEQUEST has undergone a series of improvements [
29], including the addition of dynamic modifications [
52] and the ability to interrogate nucleotide databases through six-frame translation [
53].
Developed in 1999, the Mascot program incorporates a probability-based scoring which allows discrimination against false positives, can be compared with other probabilities such as sequence homology and can be optimized by iteration [
24]. To maximise search speed and reduced data, FASTA format sequence databases are compressed, and multiple spectra originating from the same precursor are summed together. Tandem MS data are converted to peak lists of centroided mass values associated with intensity values. The match significance depends on the size of the database. Fixed and variable (so called dynamic in SEQUEST) modifications can also be included [
24]. Several common causes of failure to find a peptide match are considered in the Mascot program: (a) enzyme nonspecificity, (b) incorrect determination of precursor charge, (c) underestimated mass measurement error, (d) unsuspected chemical and post-translational modifications and (e) peptide sequence not in the database [
43].
A pubmed survey (
https://www.ncbi.nlm.nih.gov/pubmed/) with the following key words “proteom* AND mascot” or “proteom* AND sequest” indicates that even though SEQUEST predates Mascot by five years, more proteomics publications contain the term Mascot (751) than the term SEQUEST (346). This can probably be explained by the fact that the SEQUEST search engine is only distributed with a Thermo Scientific instrument whereas Mascot is a stand-alone license that can be purchased independently of the instrument used. The distribution of publications per year is available in
Supplementary Materials Table S6.
The number of accessions identified varies slightly depending on which search algorithm is employed, with SEQUEST always yielding more identities (
Table 3 and
Supplementary Materials Figure S3A). For instance, when Uniprot515 is searched, SEQUEST yields an average of 68 (+/−3) accessions, whereas Mascot produces 41 (+/−5) accessions. When a much larger database like Homemade95k is interrogated, the gain becomes more evident with 1117 (+/−118) SEQUEST-related accessions and 772 (+/−211) Mascot-related accessions. If time is a constraint and very large databases are interrogated, then SEQUEST is to be favoured, with search durations up to three times faster than when Mascot is used, as demonstrated for the largest database, Homemade95k, in
Table 4 and illustrated in
Supplementary Materials Figure S4A.
When all the algorithm-specific matches are considered for the four databases where both SEQUEST and Mascot were applied and the results represented as Venn diagrams, it paints a slightly different picture (
Supplementary Materials Figure S3C). With the exploration of the SPGP40k database, Mascot identifies 735 (out of 819 IDs, 90%) accessions across all nine samples, thus slightly outperforming SEQUEST, which yields 710 (87%) identifications; 626 (76%) are common between the two algorithms. Overall, irrespective of the database, 61% to 78% of the matches are shared among both algorithms (
Supplementary Materials Figure S3C). Some accessions are unique to SEQUEST or to Mascot, thus boosting the number of proteins identified when both programs are taken into account. Therefore, if utilizing several search algorithms is a possibility, prospective researchers should consider it.
To our knowledge, four studies have compared Mascot and SEQUEST search engines on diverse samples [
51,
54,
55,
56], however none originating from plants. Shen and colleagues reported that the number of peptides identified using Mascot was only 40–60% of that obtained using SEQUEST, attributed to numerous Mascot-related false negative identifications. Mascot rejected many peptides whose masses fell within the set tolerance and matched unique sequence tags composed of more than seven residues. The authors conclude that Mascot operates better for well-resolved, small and doubly charged peptides [
55]. Tu and colleagues also report wide differences between both search engines; however these discrepancies could be leveled out using a post-processing program such as Percolator [
56]. Cristobal et al. indicate that Mascot performs better than SEQUEST on deconvoluted MS/MS data because the latter rewards data-rich spectra such as those exhibited by large fragments displaying a wide charge envelop [
51]. Very recently, Agten and colleagues debate that resorting to multiple algorithms to search MS/MS data actually hides the information on complementarity and agreement among the search engines at the level of spectrum identification [
54]. They stipulate that the percentage sequence agreement on peptide identification at the spectrum level can assess the rate of agreement between the search engines better than a Venn diagram of matched peptides or identified proteins. The combination of both sequence annotation and sequence confidence is achieved using Mondrian-like plots and shows that Mascot matches more tandem spectra than SEQUEST [
54].
Other studies have utilized additional search engines on top of SEQUEST and Mascot, including Spectrum Mill, X!Tandem, PeptideProphet and Sonar [
57], X!Tandem and OMSSA [
58], MaxQuant [
59], InsPeCT, OMSSA, x!Tandem and MyriMatch [
28], Andromeda and SimSpectraST [
60], and MaxQuant and Andromeda [
60]. The more algorithms, the less the overlap across the identification results, as clearly illustrated with Venn diagrams or scatter plots. This begs the following question: do we consider only the common hits, or do we accept all matches regardless of their origin? We argue that all those search engines have been well designed and validated, and therefore, all peptide hits should be considered. Indeed, multiplying search algorithms has led to improved overall identification numbers and confidence [
8,
22].
3.5. Sequence Coverage and Post-Translational Modifications (PTMs)
Every shotgun proteomics experiment strives at producing the longest list of protein identifications with the broadest sequence coverage possible in order to distinguish between the various isoforms and detect PTMs.
The sequence coverage of all the proteins identified using the five different databases and listed in
Supplementary Materials Tables S1–S5 have been turned into histograms and scatterplots factoring in sequence length using the number of AAs or the MWs of the proteins identified in this study (
Figure 3).
When small databases such as SP21 and Uniprot515 are interrogated, there is a trend showing that short proteins achieve greater sequence coverage (
Figure 3A,B), albeit with many exceptions. For instance, with SP21, four proteins composed of 385 AAs, 3,5,7-trioxododecanoyl-CoA synthase (OLIS) and polyketide synthases 1, 2 and 4 (PKSG1, PKSG 2 and PKSG 4) are well covered (up to 79% with rAsp-N), whereas other proteins of similar size, such as naringenin-chalcone synthase (CHS) and chalcone synthase-like protein 1 (CHSL1), only reach a maximum of 39% and 27% coverage, respectively (
Figure 3A). Other exceptions are olivetolic acid cyclase (OAC) and cytochrome c (CYC), composed of 101 and 111 AAs, respectively; while small, these proteins’ sequences are not completely covered. Similar observations can be made with Uniprot515; generally speaking, small proteins are better covered, but some exceptions are found (
Figure 3B). When large databases like JO29k, Homemade95k and SPGP40k are explored, the trend described above becomes very clear and scatterplots confirm the negative relationship between protein MWs and sequence coverage irrespective of the protease used, as can be seen in
Figure 3C–E.
The
C. sativa proteins annotated in the UniProt Knowledge Base are known to carry modifications, and we have experimentally validated some of them using a top-down proteomics strategy [
3]. Consequently, in this study we have included the following dynamic PTMs to the search method: N-term acetylation, acetylation and methylation of K residues, oxidation of M residues, phosphorylation of S, T and Y residues and the attachment of N-acetyl-D- glucosamine (NAG) glycogroups to N residues. Examples of
C. sativa proteins bearing NAG glycosylations are CBDAS (A6P6V9) and THCAS (Q8GTB6) [
61,
62,
63]. Furthermore, following the DTT reduction and iodoacetamide alkylation of proteins during sample preparation, cysteine residues involved in disulfide bonds are expected to be reduced and carbamidomethylated. The number of fixed and dynamic PTMs discovered in this experiment are reported in
Table 7.
Depending on the database employed, between 25 and 44% of the identified peptides harbor one or several modifications. The number of PTMs varies from 277 (SP21 and Uniprot515) to 1272 (Homemade95k), again exhibiting a positive relationship with the size of the database. Most PTMs are carbamidomethylations (602/1272, i.e., 47% of all PTMs in Homemade95k,
Table 7). This is expected as many proteins comprise disulfide bridges in their secondary structures and as such hold a pivotal role in their folding, stability and activity [
64].
The proportions of dynamic PTMs fluctuate in a database-dependent fashion. For example, the second largest category of PTMs is phosphorylation for Uniprot515 (57/277, 21%) and Homemade95k (201/1272, 16%) databases, but it is methylation for JO29k (114/833, 14%) and SPGP40k (158/667, 24%) databases (
Table 7). Acetylations, whether they are located at the N-terminus of the protein or not, are well represented, particularly with SP21 (47 + 21 = 68/277, 25%), Homemade95k (91 + 132 = 223/1272, 18%) and SPGP40k (44 + 71 = 115/667, 17%). Oxidation only affects a small proportion of peptides (5–13%), suggesting that no artefactual oxidation was introduced during the sample preparation steps.
In this study, several peptides decorated with NAG (also called GlcNAc) are detected (1–4%). With SP21, they are witnessed on polyketide synthase 2 (PKSG2), cannabichromenic acid synthase (CBCAS), cannabidiolic acid synthase (CBDAS), cannabidiolic acid synthase-like 2 (CBDAS3), tetrahydrocannabinolic acid synthase (THCAS) and inactive tetrahydrocannabinolic acid (THCAI) (
Supplementary Materials Table S1). Using Uniprot515, they are additionally found on ocimene synthase (A0A5C1IY38), hedycaryol synthase (A0A4Y5QVX6) and mevalonate kinase (A0A1V0QSI0). Such NAG N-linked glycosylation sites have been reported for CBDAS (
https://www.uniprot.org/uniprot/A6P6V9) [
62], CBDAS3 (
https://www.uniprot.org/uniprot/A6P6W1), THCAS (
https://www.uniprot.org/uniprot/Q8GTB6) [
61,
65], THCAI (
https://www.uniprot.org/uniprot/Q33DQ2) and CBCAS (patent WO/2015/196275 Al [
41]). Most of these synthases are involved in highly specific secondary metabolisms, the phytocannabinoid, terpenoid and mevalonate pathways.
PTMs of cannabis proteins have also been reported by Jenkins and Osburn [
39]; oxidation of methionine residues was the most common modification, followed by acetylation of lysine residues and phosphorylation of serine and threonine residues. Interestingly, the authors indicate that the acetylated proteins were unique to mature flowers and absent in leaves and stems from the male plants. Protein PTMs represent a major level of cellular regulation, acting either swiftly and reversibly, such as phosphorylation, or slowly and irreversibly, such as certain forms of glycosylation. Whilst gene expression merely regulates protein abundance, PTMs control their three-dimensional structures, thus revealing or concealing active sites and interfaces for protein–protein interaction, which in turn modulates the protein subcellular localization, stability and activity. Acting as molecular switches of proteins, PTMs may initiate and inhibit the interaction of proteins with DNA, cofactors and lipids, as well as with other proteins [
66]. The human proteome is the best proteome characterized so far and MS has enabled the discovery of most of the PTMs known today. A catalog of 81,721 unique phosphorylated peptides belonging to 11,025 proteins substrates of kinases, 29,031 unique ubiquitinylated peptides corresponding to 5769 proteins substrates of ubiquitin ligases, 16,693 acetylated peptides from 7098 proteins that are substrates of acetylases and 7977 proteins and carboxy-terminal peptides for 6778 proteins confirming a large number of translation start and stop sites have been established [
67]. As evidenced in this study and previous works [
1,
2,
39],
C. sativa hosts numerous PTMs with many more to be discovered as the number of proteomics experiments on
C. sativa gain momentum, leveraged by a relaxation in the legislation. Neither genomics nor transcriptomics analysis can identify PTMs, only protein analyses can deliver such valuable information. Experimentally detecting PTMs using MS is a first step; functionally characterizing them is another critical step that is needed in order to shed more light onto the biology of this unique plant.
3.6. Database Specificity and Gene Ontology (GO)
Four databases contain exclusive sequences from
C. sativa, whereas SPGP40k includes all the sequences from SwissProt limited to the
viridiplantae (i.e., green plants) taxonomy. The 819 identifications obtained in this study using the green plant taxonomy emanate from 175 different plant species. The histogram in
Supplementary Materials Figure S5 displays 32 species represented by more than 4 accessions.
Most identities (267/819, 33%) come from
Arabidopisis thaliana, which is the model plant species and therefore the most studied, sequenced and best annotated organism. Then,
Oryza sativa, the cereal model, ranks second with 67 (8%) accessions. Only eight (1%) accessions originate from
C. sativa, which ranks 11th and is equally placed with
Cucumis sativus and
Gossypium hirsutum (
Supplementary Materials Table S5). The SPGP40k database comprises 19
C. sativa (CANSA) entries, which corresponds to only 0.05% of the number of total entries. All the 19 CANSA entries are also included into the SP21 database, which yields 18 accessions when searched in this study (
Supplementary Materials Table S1). Therefore, it is not clear why only 8 (out of 21, 42%) CANSA accessions are found when SPGP40k is searched; it is as though they underwent a database dilution. Because decoy searches were performed and PSM validated, we do not expect false positives to occur in our analyses. Sequences from non-model species such as
C. sativa are greatly underrepresented in the most reputable protein reference database, UniProt, even though this species’ genome has been sequenced and annotated in NCBI. It would be useful for the proteomics community to have these sequences and all their known annotations (e.g., GO terms, PTMs, signal peptide, etc.) available from the UniProt Knowledge Base. Unsequenced organisms primarily face the challenge of bioinformatic data analysis, particularly when no close relatives have been sequenced [
68]. The proteome of another non-model species,
Artemisia annua, was also mined using various databases originating from RNA sequencing or retrieved from NCBInr and UniProt repositories. The searched databases were limited to only
A. annua species or included the whole
viridiplantae taxonomy; accordingly, the number of entries ranged from 118 to more than 1 million sequences [
69]. Large specific databases led to the identification of almost 700 accessions, whereas
viridiplantae databases listed about half that number. Similar conclusions were drawn more recently on cacao, also a non-model species [
70]. The authors report that the largest number of identities (906) were obtained using a database made of the
Theobroma cacao genomic sequences translated into six reading frames and containing 59,577 entries. The
T. cacao UniProt/Trembl and NCBI databases yielded 897 and 870 protein identifications, respectively. NCBInr
viridiplantae, the largest database searched in this study in excess of 3 million entries, identified 759 proteins. Both these works thus demonstrate that database specificity rather than exhaustivity is a key factor to consider for proteomics analyses; we also report this on
C. sativa. When studying non-model plant species for which no genomic sequencing data is available, searching the
viridiplantae database and its sub-taxonomies has proven invaluable to explore their proteomes, as was evidenced in pomegranate [
71], quinoa [
72],
Pinus occidentalis [
73] and cumin [
74].
Using the UniProtKB Retrieve/ID mapping online tool (
https://www.uniprot.org/uploadlists/) and the Uniprot accessions identified using Uniprot515 and SPGP40k databases, we performed a GO classification; this is not feasible with accession numbers from MPGR and NCBI. The longer the list of annotations, the more exhaustive the insight into the plant biology, as displayed in
Supplementary Materials Figure S6 based on the Biological Processes classification.
As more proteins are identified when a large database such as SPGP40k is used (
Supplementary Materials Figure S6B) relative to a small database such as Uniprot515 (
Supplementary Materials Figure S6A), more biological processes are listed. In the case of medicinal cannabis, 581 metabolic processes appear including nitrogen compounds (340), primary (452), small molecules (224) and organic substances (508) metabolisms. Yet, only 7 results are assigned to the secondary metabolism.
C. sativa manufactures a plethora of compounds found nowhere else, the best known being the phytocannabinoids [
30,
31,
32,
33,
34]. This is not reflected in the classification depicted in
Supplementary Materials Figure S6B because no other species resembles cannabis, which is truly unique. This
viridiplantae database gap in SwissProt and UniProt must be urgently filled.