Mass spectra are crucial as the start and the basic unit of shotgun proteomics data. However, spectra exhibit significant noise level and heterogeneity, limiting the efficiency of current spectrum analysis algorithms. Fortunately, large-scale pretrained models have introduce a new paradigm in artificial intelligence. Through pretraining on extensive datasets, these models develop a foundational understanding of data, leading to enhanced performance across downstream tasks. Consequently, developing a foundation model for spectrum data to decode the basic "words" of proteins and promoting proteomics data analysis, is challenging yet highly promising.
Here, we introduced the first pretrained model in the proteomics field, pi-SPECFormer, designed to generate insightful spectrum embeddings for various spectrum analysis tasks. During its pretraining phase, pi-SPECFormer utilized a dictionary lookup task to identify analogous spectrum pairs in the representation space, achieving a comprehensive understanding of peak distribution patterns within spectra. In downstream applications, pi-SPECFormer was applied to de novo sequencing, presenting the state-of-the-art performance on the widely-used nine-species benchmark dataset, surpassing leading algorithms Casanovo (Yilmaz, M. et al.) and PointNovo (Qiao, R. et al.) by up to 14% and 28% in relative peptide recall. In the spectrum clustering task, pi-SPECFormer achieved relative 99% error reduction while maintaining a consistent clustering degree compared to the recognized spectrum embedder GLEAMS (Bittremieux, W. et al.). In the PTMs detection task, pi-SPECFormer demonstrated exceptional accuracy in identifying the characteristics of various PTMs, outperforming the leading algorithm AHLF (Altenburg, T. et al.) by 10-22% in relative accuracy for phosphorylation detection. Furthermore, it achieved over 95% accuracy across all 21 PTMs types in a synthesized dataset. These benchmark studies demonstrated pi-SPECFormer was a pioneering foundation model for MS data analysis. Our work highlighted the value of pretrained models for MS data, exploring and pushing the boundaries of foundation models in the proteomics field.