Benjamin Heiniger (Zurich / CH), Christian Schori (Zurich / CH), Mohammad Arefian (Belfast / GB), Amir Banaei-Esfahani (Zurich / CH), Martin Schuler (Zurich / CH), Sonia Borell (Basel / CH), Daniela Brites (Basel / CH), Ruedi Aebersold (Zurich / CH), Sébastien Gagneux (Basel / CH), Ben Collins (Belfast / GB), Christian Ahrens (Zurich / CH)
Objectives: Accurate and comprehensive prediction of all protein coding genes is still an unresolved issue. Genes encoding small proteins are often missed in genome annotations, even in well-studied bacterial model organisms such as Escherichia coli [1] and Pseudomonas aeruginosa [2]. Yet, these small proteins can carry out many important functions [3]. Using proteogenomics, we aim to identify protein coding genes missed by conventional genome annotation methods in six Mycobacterium tuberculosis clinical reference strains from lineages 1 and 2. M. tuberculosis is among the top bacterial infectious diseases world-wide, with the modern lineage 2 being especially virulent.
Methods: We de novo assembled complete genomes of six clinical reference strains from long read PacBio data. By hierarchically integrating reference annotations, ab initio gene predictions and a modified six-frame translation that considers alternative start codons, we created large but minimally redundant integrated proteogenomics search databases (iPtgxDB) [4], where ~95% of peptides unambiguously identify one protein [5]. Total cell extracts were analyzed with Parallel Accumulation-Serial Fragmentation (PASEF) mass spectrometry (MS). After strict FDR filtering and validation we prioritized novel small proteins based on functional predictions and conservation.
Results: The assembly of complete genomes allowed us to overcome drawbacks of fragmented, short-read based assemblies that can even miss essential genes [2]. Comparative genomics identified a large core genome (94% of genes), reflecting the lack of horizontal gene transfer in Mycobacterium. Proteogenomics allowed us to identify 35 novel proteins which were enriched in proteins shorter than 100 amino acids and had a significantly higher average of pI values than annotated proteins. Interestingly, the novel candidates included strain and lineage specific proteins and three candidates with a predicted function in toxin-antitoxin systems.
Conclusion: We successfully applied our proteogenomics framework for prokaroytes, which is available as a public web server (https://iptgxdb.expasy.org) [3], to six clinical reference strains of a major human pathogen [7]. By leveraging state of the art tandem MS we were able to identify novel small proteins with potential roles in pathogenicity without the need for sub-cellular fractionation.
References
[1]. Storz G, et al. EcoSal Plus 2020, 9:10.
[2]. Varadarajan AR et al. NPJ Biofilms & Microbiomes 2020, 6:46.
[3]. Storz G, et al. Annu Rev Biochem 2014, 83: 753-777.
[4]. Omasits U et al. Genome Res 2017, 27: 2083-2095.
[5]. Qeli E & Ahrens CH. Nat Biotechnol 2010, 28:647-650.
[7]. Heiniger B et al. (in preparation).
We use cookies on our website. Cookies are small (text) files that are created and stored on your device (e.g., smartphone, notebook, tablet, PC). Some of these cookies are technically necessary to operate the website, other cookies are used to extend the functionality of the website or for marketing purposes. Apart from the technically necessary cookies, you are free to allow or not allow cookies when visiting our website.