.css-1xsl8rf{width:100%;display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;position:relative;overflow:hidden;background:var(--alert-bg);-webkit-padding-start:var(--chakra-space-4);padding-inline-start:var(--chakra-space-4);-webkit-padding-end:var(--chakra-space-4);padding-inline-end:var(--chakra-space-4);padding-top:var(--chakra-space-1);padding-bottom:var(--chakra-space-1);--alert-fg:var(--chakra-colors-orange-600);--alert-bg:var(--chakra-colors-orange-100);font-weight:var(--chakra-fontWeights-semibold);padding-left:var(--chakra-space-4);}.chakra-ui-dark .css-1xsl8rf:not([data-theme]),[data-theme=dark] .css-1xsl8rf:not([data-theme]),.css-1xsl8rf[data-theme=dark]{--alert-fg:var(--chakra-colors-orange-200);--alert-bg:rgba(251, 211, 141, 0.16);}@media screen and (min-width: 48em){.css-1xsl8rf{padding-left:var(--chakra-space-8);}}Bitte aktivieren Sie Javascript um alle Funktionen nutzen zu können und ihre Nutzererfahrung zu verbessern.

Poster

P-II-0456

Tesorai Search: Large pretrained model boosts peptide identifications without the need for Percolator

Beitrag in

New Technology: AI and Bioinformatics in Mass Spectrometry

Posterthemen

New Technology: AI and Bioinformatics in Mass Spectrometry

Mitwirkende

Peter Cimermancic (San Diego, CA / US), Maximilen Burq (San Diego, CA / US), Bryan Crampton (San Diego, CA / US), Dejan Stepec (San Diego, CA / US), Juan Restrepo (Martinsried / DE), Shivani Tiwary (San Diego, CA / US), Ioana Clotea (New York, NY / US), Jure Zbontar (San Diego, CA / US), Shamil Urazbakhtin (Martinsried / DE), Beatrix Ueberheide (New York, NY / US), Jürgen Cox (Martinsried / DE)

Abstract

Current search algorithms for mass spectrometry proteomics typically fail to identify up to 75% of spectra. To mitigate the issue, machine learning (ML) has become a staple of modern database search engines for mass-spectrometry proteomics. Second-generation software such as Percolator, PeptideProphet, Prosit, AlphaPeptDeep, MSBooster, Peaks, Chimerys and others show impressive increases in the number of peptides and proteins identified, when compared to first-generation tools (eg, Sequest, X!Tandem, Comet, Andromeda, MS-GF+, and Sage). However, recent evidence has shown that the increased identifications come at the cost of a significant underestimation of the FDR estimate. They observe that this is inherent to how these tools leverage ML, where a new model is re-trained to separate targets from decoys for every run. This approach stands in sharp contrast with the recent trend in other fields of ML, where large models are trained only once and then applied on a wide range of modalities and use-cases.

We hypothesized that training a single large peptide-spectrum matching (PSM) model can improve upon the gains from using Percolator-like tools, while solving the FDR control issue. To that end, we built a novel approach to peptide-spectrum matching, based on a pre-trained large deep learning model, which does not utilize decoys during training and does not require training a new model for every new sample. We trained the Tesorai model on over 100M real peptide-spectrum pairs, demonstrating that the approach performs robustly across a wide range of use-cases including standard trypsin-digested samples, metaproteomics, single-cell, and isobaric-labeled samples.

In addition to providing robust FDR control, our method consistently increases identifications when compared to state-of-the-art approaches. For example, in immunopeptidomics, our approach leads to up to 90% increase in peptide identification, compared to MSFragger + Percolator. We are in the process of validating the new peptide identification, in a blinded reader study by human experts. The preliminary results indicate strong performance of the model, with its unique PSMs aligning with expert reviews better than PSMs unique to other search algorithms.

With the significant improvement of the immunopeptidome coverage, we next set to analyze the landscape of the HLA-I-bound peptide sequences, re-analyzing several publicly available datasets. Interestingly, we find that our model enriches for non-tryptic peptides, suggesting trypsin-digest biases in current software, and offering new insights into immunopeptidome. Finally, to facilitate new discoveries in MS proteomics more broadly, we incorporated our model into a cloud-native, end-to-end system which can process 500 samples in less than an hour, and made it publicly available at www.tesorai.com.