Armin Soleymaniniya (Freising / DE), Miriam Abele (Freising / DE), Sarah Brajkovic (Freising / DE), Christina Ludwig (Freising / DE), Bernhard Küster (Freising / DE), Mathias Wilhelm (Freising / DE)
Introduction
ProteomicsDB has evolved over the years into a multi-omics and multi-organism resource for life sciences research, encompassing more than 198 projects totaling over 29k LC-MS/MS experiments and providing access to all this data with FAIR standards. The latest developments focus on cross-species analysis at the protein level. By leveraging the principle that proteins with common ancestry tend to exhibit resemblances despite evolutionary divergence, homology analysis empowers researchers to glean insights into the functions of uncharted proteins in lesser-known organisms. In addition to empowering the transfer of knowledge from well-studied organisms to less studied ones, this approach can provide insights into phylogenetical lineage-enriched traits of organisms.
Methods
To increase the number of covered organisms, we integrated the proteomics baseline data of 305 bacterial species into ProteomicsDB. The data was searched with MaxQuant (100% FDR) and rescored with Prosit via the Oktoberfest pipeline. The false discovery rate was controlled at 1% by the picked-group FDR algorithm. Label-free quantification was done using the iBAQ method to estimate protein abundances. The homology inference was done with OrthoFinder, which assigns proteins to "orthogroups" that reveal the paralogy/orthology relations among them. The functional annotation of protein sequences was analyzed using InterPro.
Results
We integrated data from the "Proteomics landscape across the bacterial kingdom of life", an atlas encompassing the baseline proteome of 305 bacterial species. This is the most comprehensive collection of prokaryote draft proteomes to this date, identifying 14,140,252 PSMs and 7,005,416 peptides (1% FDR). Furthermore, we substantially improved the PSM identification by 11.89% and peptides by 9.69% (on average) by rescoring the search results with Prosit. This data covers, on average, 50% of the theoretical proteome of studied species.
Homology analysis with OrthoFinder on the theoretical proteomes assigned a total of 1,131,620 (96.9% of overall) proteins to 34,979 orthogroups with a median size of four proteins. Among these, only 99 orthogroups cover all of the studied organisms, and 26 consist of only one protein per species. The inferred orthogroups were analyzed to detect possible enrichments on specific phylogenetic levels, which can be a proxy for studying the lineage-specific functions of bacteria. Furthermore, the retrieved GO terms via the functional annotation analysis were enriched at the orthogroup level to provide insights into the functions of protein groups. We developed a new interactive visualization that enables users to compare the expression of proteins within one orthogroup across participating species in a phylogenetic-aware fashion. Also, we implemented a histogram visualization of protein expressions of the corresponding species level to replace the anatomograms on the protein expression page for single-cell organisms.