Tobias Schmidt (Garching / DE), Siegfried Gessulat (Garching / DE), Michael Graber (Garching / DE), Vishal Sukumar (Garching / DE), Markus Schneider (Garching / DE), Alexander Hogrebe (Garching / DE), Samia Ben Fredj (Garching / DE), Daniel Zolg (Garching / DE), Martin Frejno (Garching / DE)
Background: Synthetic proteomics datasets, like ProteomeTools, are invaluable for machine learning but finite, especially regarding spectra from post-translational modifications. In contrast, public repositories abound with data, yet harmonizing them poses challenges due to diverse measurement environments, methods, software, and calibrations, lacking shared QC peptides. To address this, we developed a distributed, standardized, and cost-effective workflow for searching and harmonizing public data with minimal manual intervention. Processing over 3000 files containing ubiquitinated, phosphorylated, and acetylated peptides, we used a novel machine learning-based calibration. This resulted in ~1 million modified peptides from >33 million high-quality PSMs, complements synthetic public data such as ProteomeTools.
Methods: The workflow takes a PRIDE or MASSIVE identifier and a FASTA file as input, producing a standardized PSM table with optional collision energy calibration and fragment ion annotation. Results are stored as parquet files on S3 and executed on a 30-node Kubernetes cluster. Utilizing ppx for spectrum file downloading, msconvert for transformation, and Sage for search, along with custom algorithms for calibration and annotation, we achieve automation. A web interface displays QC metrics, aiding project selection. Collision energy is calibrated by generating curves to determine offsets, and retention times are normalized for transfer learning.
Results: The goal is to unlock publicly available proteomics data for machine learning, costing <$0.10 per spectrum file and taking 36 seconds on average, processing 100 files per hour. Unique features include model-based collision energy calibration, yielding high correlations (>0.85) compared to QC-based methods. Processing 60 PRIDE projects yielded >400,000 unique peptides and >10 million PSMs, similar results from the Atlantic NCI60 dataset and ProteomeTools pools. We processed 60 PRIDE projects with >1000 files containing ubiquitinated peptides from human samples, yielding >400,000 unique peptides and >10 million PSMs. Processing additional species is expected to increase this yield by ~75%. In addition, we processed the Atlantic NCI60 dataset containing phosphorylated peptides from ~500 files and identified >200,000 unique peptides and ~6 million PSMs. Similarly, searching two ProteomeTools pools for N-terminally acetylated peptides (an FMOC-synthesis artifact) identified ~350,000 unique peptides and ~16 million PSMs. Using the harmonized dataset, we achieved high correlations (>0.85) on modified peptides, surpassing non-calibrated training. In summary, our fast, cost-effective workflow produces a high-quality dataset, complementing synthetic datasets for training peptide fragmentation prediction models.
Conclusion: Hands-off, inexpensive workflow to search, annotate and calibrate arbitrary public proteomics datasets and integrate them into harmonized machine learning datasets.