*3.9. Proteogenomics*

After assembling a genome, dedicated programs conduct the automatic annotation of ORFs. However, this annotation is not definitive, and a continuous effort of curation is needed. Transcriptomic analysis enables the obtainment of complete gene model annotation, including untranslated regions that are key to understand post-transcriptional regulation mechanisms. In addition, a proteogenomic analysis, such as that reported here, represents a powerful and useful approach for the identification of non-annotated genes, the correction of misannotations, or the validation of gene annotations [23]. For this purpose, in this work, the experimentally obtained peptide spectra were searched against all the polypeptides longer than 20 amino acids predicted from the ORFs found in the six possible translation frames of the recently re-sequenced genome for *L. infantum* JPCM5 strain [13]. The majority of the identified peptides fit well in current gene annotation (available at TriTryDB.org and http://leish-esp.cbm.uam.es/). Nevertheless, some peptides mapped to non-annotated coding-regions in the *L. infantum* (JPCM5) genome, uncovering eight novel protein-coding genes (Supplementary File, Table S15). Figure 6 shows an example of a novel hypothetical protein found in chromosome 27, together with the MS spectra of the two peptides that allowed for its identification.

As mentioned above (and illustrated in Figure 5), some of the detected peptides were mapped to regions located upstream of annotated coding sequences (CDS). This led to the addition of N-terminal extensions to 34 annotated proteins and the establishment of new translation start codons for their corresponding genes (Supplementary File, Table S16). All the detected peptides were confirmed to be unique, and the accuracy of their MS/MS spectra was manually revised.

**Figure 6.** Identification of a novel protein based on the mapping of two experimentally detected-peptides in a region of *L. infantum* JPCM5 chromosome 27, which currently lacks an annotated ORF. The new CDS, named LINF\_ 270,022,950 (pink box), fit well within a predicted transcript (LINF\_27T0022950; green box). Interestingly, the predicted amino acid sequence was well-conserved when compared with proteins annotated in the genomic assemblies of *L. major* LV39 (ID: LmjVL39\_270022400) and *Leishmania gerbilli* LEM452 (ID: LGELEM452\_270022800).
