Zhiwei Liu (Hangzhou / CN), Pu Liu (Hangzhou / CN), Zongxiang Nie (Hangzhou / CN), Yingying Sun (Hangzhou / CN), Yuqi Zhang (Hangzhou / CN), Yi Chen (Hangzhou / CN), Tiannan Guo (Hangzhou / CN)
Analyzing data-independent acquisition mass spectrometry (DIA-MS) based proteomics data remains a formidable challenge. DIA identification tools, exemplified by DIA-NN, typically employ a multi-stage approach, initially scoring peak groups and then refining the selection through iterative filtering. This iterative process harnesses an expanding array of features, progressively enhancing the discriminative power of the scoring model to bolster selection accuracy. Nevertheless, extant DIA identification tools encounter constraints. They employ independent scoring models for each mass spectrometry file, and owing to the restricted training samples from each file, these tools frequently resort to conventional machine learning models with limited capacity to avert overfitting. Consequently, they often fall short in adequately capturing intricate peptide-peak-group matching patterns. The prevalent reliance on manually engineered features and the isolation of feature learning from classifier training in current methodologies hinder optimization and limit overall analytical performance.
BERT, an acronym for Bidirectional Encoder Representations from Transformers, stands at the forefront of natural language processing (NLP) breakthroughs. Here, we introduce MS-BERT, a transformer-based pre-trained AI model tailored for analyzing DIA-MS proteomics data. This paper unveils a pioneering DIA identification algorithm leveraging an encoder-only transformer model. The models in the initial and subsequent stages undergo training on extensive datasets comprising 233 million and 7.4 million training instances sourced from 600 mass spectrometry files, respectively. The breadth of these datasets equips our models with exceptional generalization prowess, facilitating precise identification without necessitating separate training on test files. Harnessing the robust data modeling capabilities inherent in convolutional and transformer networks empowers our models to adeptly capture intricate peptide-peak-group matching patterns.
We have obtained a novel human proteome dataset, comprising three samples each from human tissues and human serum, for performance evaluation. In comparison to the prevailing DIA-NN software, MS-BERT demonstrates the capability to identify over 4,000 human serum proteins, resulting in a remarkable enhancement of 45% in protein identification depth and 22% in precursor coverage within human proteomes. For quantification assessment, we utilized a three-species dataset (human, yeast, and C. elegans) sourced from the PRIDE database. Notably, MS-BERT exhibits quantification performance on par with DIA-NN, evidenced by a Pearson"s correlation coefficient of 0.91 for protein quantification.
This study introduces a robust new software tool for DIA proteomic identification and quantification, underscoring the substantial potential of pre-trained models and synthetic datasets in propelling advancements in DIA proteomics analysis.