Christina Ludwig (Freising / DE), Miriam Abele (Freising / DE), Armin Soleymaniniya (Freising / DE), Florian P. Bayer (Freising / DE), Nina Lomp (Freising / DE), Etienne Doll (Freising / DE), Chen Meng (Freising / DE), Klaus Neuhaus (Freising / DE), Siegfried Scherer (Freising / DE), Mareike Wenning (Oberschleissheim / DE), Nina Wantia (Munich / DE), Bernhard Küster (Freising / DE), Mathias Wilhelm (Freising / DE)
Despite the vast diversity of the prokaryotic domain of life, relatively few bacterial proteomes have been comprehensively characterized with mass spectrometry so far. Here, we present the largest proteomic resource for bacteria to date, encompassing 305 species, 119 genera, and five phyla. This resource provides mass spectrometric evidence for over 635,000 expressed bacterial proteins and confirms the existence of over 35,000 hypothetical proteins. The complete dataset is available in the public resource ProteomicsDB (https://www.proteomicsdb.org/), enabling quantitative exploration of all detected proteins within a species as well as across different species. To facilitate cross-species analyses, we have grouped bacterial proteins into orthogroups using the tool OrthoFinder, which identifies homology relationships between proteins based on their sequence similarities.
This comprehensive dataset facilitated the development of a new bacterial identification algorithm called PyTaxon. PyTaxon employs the rapid data analysis tool MSFragger to query in a first iteration the complete species-level proteome space deposited in NCBI (n = 13,855 species). Subsequently, a second iteration analysis can be conducted to refine the results to strain-level by focusing on the identified taxonomical region from the first iteration. Using the dataset of the 305 bacterial species as a benchmark, we demonstrate that PyTaxon achieves identification accuracies of >99% on species-level. When applied to a dataset comprising 94 strains of the genus Pseudomonas spp. and 29 strains of the Bacillus cereus s.l. group, PyTaxon attains unprecedented identification accuracies of >89% on strain-level.
Finally, we evaluated PyTaxon's performance as a tool for food safety and clinical diagnostics. When applied to 55 pure bacterial isolates from dairy products, PyTaxon provided more accurate results than MALDI-TOF or FTIR, with identifications that were 94% consistent with 16S rRNA gene sequencing. In 570 clinically derived bacterial isolates, PyTaxon successfully identified 64 different bacterial species from 20 genera and 3 phyla. Additionally, we explored the acquired data for the expression of proteins responsible for antibiotic resistance. Beta-lactamases, including also extended-spectrum beta-lactamases (ESBL), were detected in every fourth clinical sample. This alarming finding underscores the urgency for rapid antimicrobial resistance screenings to enable informed antibiotic treatment decisions and ideally mitigate further acquisition of (multiple-) antibiotic resistance.
In conclusion, our work establishes a comprehensive proteomic resource detailing protein expression profiles across the bacterial domain of life. This resource facilitated the development of an advanced bacterial identification algorithm and provided insights into the prevalence of antimicrobial resistance markers in clinical samples.