Yanick Paco Hagemeijer (Groningen / NL), Yuan Hu (Groningen / NL), Tessa Gilette (Groningen / NL), Frank Klont (Groningen / NL), Cornelia Verhoeven (Groningen / NL), Justina Clarinda Wolters (Groningen / NL), Gyorgy Halmos (Groningen / NL), Victor Guryev (Groningen / NL), Peter Horvatovich (Groningen / NL)
Introduction
Common proteomics workflow uses manually curated public protein sequences, which either use canonical sequences (one sequence per protein coding genes) or a limited number of well-known high frequency occurring isoforms and protein variants. This limits the potential to study the presence and effects of protein variants in proteomics studies in context of basic biology and clinical research. Transcriptomics are often used in research as proxy for protein variants, however, only fraction of these variants can be detected at sufficient levels in the proteome. In this work we present a generic proteogenomics data integration workflow1 called Groningen Proteogenomics Workflow (GPW), which predicts protein sequence variants from polyadenylated transcriptomics data obtained from the same sample and connects that to the generated proteomics results. GPW is designed to predict a wide variety of protein variants beside the well-known annotated sequences such as single amino acid variants, indels, splice isoforms and translated sequences from new open reading frames (ORFs).
2. Approach
GPW has been implemented in python and nextflow2 and include mRNA quality assessment tool (FastQC) and alignment tool for reference genome (STAR), splice isoform prediction (StringTie), de novo assembly of unmapped reads (Trinity), variant calling (GATK, bcftools), functional annotation using VEP and translation of mRNA sequences to protein amino acid sequences with transdecoder. The translated sequences are used for peptide and protein identification in data dependent and independent (DDA/DIA) LC-MS/MS proteomics data are analyzed with the FragPipe3 toolset. GPW produces tables and plots summarizing the different protein variant categories.
3. Results and discussion
The power of GPW is demonstrated on laryngeal squamous cell carcinoma and controls of healthy mucosa of the same site showcasing the effect of protein variants in tumor characteristics of young and elderly patients as well as to identify the genes with missense variants in laryngeal cancer and healthy tissue controls of the same patients.
References
Hagemeijer YP, Guryev V, Horvatovich P. Accurate Prediction of Protein Sequences for Proteogenomics Data Integration. Methods Mol Biol. 2022;2420:233-260.Di Tommaso P., et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 2017, 316–319.Yu, F, Teo, G. C., Kong, A. T., Fröhlich, K., Li, G. X. , Demichev, V, Nesvizhskii, A..I. (2023). Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform, Nature Communications 14:4154.