Poster

  • P-II-0447

Zero-shot retention time prediction for unseen post-translational modifications with molecular structure encodings

Beitrag in

New Technology: AI and Bioinformatics in Mass Spectrometry

Posterthemen

Mitwirkende

Ceder Dens (Antwerp / BE), Darien Yeung (Winnipeg / CA), Oleg Krokhin (Winnipeg / CA), Kris Laukens (Antwerp / BE), Wout Bittremieux (Antwerp / BE)

Abstract

Introduction

Mass spectrometry-based proteomics has advanced our understanding of cellular processes, disease mechanisms, and drug target screening. However, identifying proteoforms carrying diverse post-translational modifications (PTMs) remains challenging. PTMs regulate protein activity and interactions and impact stability and localization. Limited understanding of the myriad PTMs and their influence on experimental properties like liquid chromatography (LC) behavior presents a significant hurdle. Peptide property prediction, such as retention times (RTs), has improved peptide identification but primarily for unmodified peptides or those with limited diversity in modifications.

Methods

We present a Molecular Structure Transformer (MST), a transformer-based RT prediction model for peptides with any PTM. Our novel input representation encodes each amino acid and its PTM as a molecular structure graph, which is processed by a molecule encoder to obtain a separate embedding for each residue. Next, a transformer encoder is trained to predict the RT from these embeddings, enabling accurate predictions, even for peptides containing PTMs not seen during training.

To enhance accuracy, we introduce a second model employing a two-step strategy. First, a regular transformer encoder predicts the RT of the unmodified sequence. Then, the MST predicts the RT shift induced by the modification. This strategy leverages the high prediction accuracy for unmodified peptides and the superior input representation for modified peptides.

Results

DeepLC, our MST, and our two-step MST were trained on the Chronologer dataset. This dataset contains the RT of 1.3M unmodified and 913K modified peptides with 9 different PTMs. Subsequently, the models were tested on a collection of external datasets from various experiments, including ProteomeTools. The test dataset contains 70K peptides with 16 PTMs that were not included during training. Each test peptide contained at least one of these PTMs. We calculate the absolute error (AE) of the predicted iRT to compare the model"s performance. Over the complete test dataset, DeepLC achieves a mean AE (MAE) ± standard deviation of 24.07 ± 13.44, our MST does worse with 26.71 ± 23.19 but our 2-step model does significantly better with 12.83 ± 11.83. As a second metric, we calculate the macro average PTM MAE, which averages the MAE per unseen PTM. DeepLC achieves 12.54 ± 9.17, our MST 12.18 ± 9.71 and the 2-step model outperforms both models again with 10.57 ± 6.57. This marks a big improvement over previous approaches and constitutes state-of-the-art RT prediction performance on peptides containing previously unseen modifications.

    • v1.20.0
    • © Conventus Congressmanagement & Marketing GmbH
    • Impressum
    • Datenschutz