Benjamin Heiniger (Zurich / CH), Ulrich Omasits (Zurich / CH), Lydia Hadjeras (Wuerzburg / DE), Christian Schori (Zurich / CH), Jakob Meier-Credo (Frankfurt / DE), Elena Evguenieva-Hackenberg (Giessen / DE), Cynthia Sharma (Wuerzburg / DE), Julian D. Langer (Frankfurt / DE), Wolfgang Hess (Freiburg / DE), Christian Ahrens (Zurich / CH)
Complete prokaryotic genome sequences represent an optimal basis for the comprehensive annotation of all protein-coding genes. Yet, in stark contrast to the bounty of genome data, our ability to accurately predict all protein-coding genes is still severely limited [1]. This is best illustrated by the growing evidence that so far unannotated, small ORF encoded proteins (SEPs) carry out numerous key biological functions [2,3], including roles in cell-cell communication and as peptide antibiotics, shaping microbiome composition.
The genome annotation problem covers i) the large differences of the number of coding sequences (and their precise start sites) predicted by different annotation centers for an identical genome sequence, ii) the severe under-representation of SEPs, and iii) the continuously changing annotation over time, which is difficult to track. To address the first two issues, we developed a generic proteogenomics approach that can integrate and consolidate different reference annotations, ab initio gene predictions and in silico ORF predictions in a large but minimally redundant integrated proteogenomics (iPtgxDB) search database, that covers the entire protein coding potential of a prokaryotic genome [1]. Such iPtgxDBs have been successfully applied in numerous projects to uncover proteomic evidence for novel SEPs.
Today, I will illustrate the succesful application of our solution, which is available as a public web server (https://iptgxdb.expasy.org), to some highly relevant use cases, e.g. to find novel SEPs not only in single bacteria but also in synthetic model communities [4] and clinical strains. By integrating data from top-down [5] and PASEF mass spectromety approaches, we can increase the number of novel SEPs identified. Importantly, we show that incroporating top Ribo-seq candidates can help to minimize the search database size and thereby improve detection of known and novel SEPs [6]. Finally, we adapt the basis of our proteogenomics approach to address the much more common issue -and unmet need of the research community- to integrate and consolidate differing genome annotations in the context of experimental data. Such solutions promise to enable research communities that work with different model organisms to unlock the full potential of functional genomics technologies to captialize on the value of genome sequence data.
References
[1]. Omasits U, et al. Genome Res 2017, 27: 2083-2095.
[2]. Storz G, et al. Annu Rev Biochem 2014, 83: 753-777.
[3]. Orr MW, et al. Nuclei Acids Res 2020, 48 :1029-1042.
[4]. Petruschke H, et al. Microbiome 2021, 9:55.
[5]. Meier-Credo J, et al. Anal Chem 2023, 95:11892-11900.
[6]. Hadjeras L, et al. Microlife 2023, 4:uqad012.