Dominik Lux (Bochum / DE), Julian Uszkoreit (Bochum / DE), Brit Mollenhauer (Goettingen / DE; Kassel / DE), Katalin Barkovits-Boeddinghaus (Bochum / DE), Martin Eisenacher (Bochum / DE), Katrin Marcus-Alic (Bochum / DE)
Introduction:
Today's data analysis in MS-based proteomics primarily relies on typical identification and quantification workflows. These workflows usually utilize a FASTA database where mostly only canonical sequences of proteins are provided. In such workflows a theoretical spectrum is matched to an actual measured spectrum (search engine) returning identified spectra, which further are used to quantify the found peptide/protein. While these workflows usually suffice for the selected use-cases, not all spectra are identified and only a part of all measured molecules in a mass spectrometer are quantified, thus not reported in the final data analysis. Our workflows aim to shed some light on the unknown, trying to identify non-canonical sequences and by quantifying charged molecules, regardless of whether they are identified or not.
Methods:
In our lab, we developed two workflows: a sophisticated FASTA generator (1) and a MS1-based quantification workflow (2).
(1) generates a peptide FASTA databases containing already digested peptides, complemented with signal-, pro- and other specially cleaved peptides as well as peptides containing variants and mutations. This is achieved using protein graphs generated via ProtGraph, the MS2 precursors and a sophisticated traversal algorithm implemented in C++ to retrieve the peptide entries.
(2) utilizes OpenMS to find features (FeatureFinderCentroided). Label-free matching across multiple measurements is done via MapAlignerTreeGuided and FeatureLinkerUnlabeled, using identified features as anchor points. The actual quantitative values for all features is then extracted via the ThermoRawFileParser XIC-Extraction.
These two workflows are implemented in Nextflow, with various Python scripts used as intermediate steps. Identification in this combined workflow is done via Comet and Percolator with a q-value cut-off of 1%, using the human proteome database as plain text format, and allowing for up to 5 variants per peptide.
Results:
We applied this combined workflow on measured CSF samples (two groups, DDA). Our results demonstrate the benefit of using a custom tailored FASTA database from workflow (1), which includes entries typically not searched for, thereby increasing the number of identified spectra in the CSF dataset. Workflow (2) shows that charged molecules can be quantified, regardless of whether an identification or even a MS2 spectrum is present. By combining and normalizing the results of each workflow, we generated a volcano plot, containing unidentified and non-canonical entries alongside of usual canonical entries, illustrating the additional data points gained by this combined workflow.