Karl A. T. Makepeace (Victoria / CA), Pallab Bhowmick (Victoria / CA), David Goodlett (Victoria / CA), Peter van Veelen (Leiden / NL), Christoph H. Borchers (Montreal / CA), Yassene Mohammed (Leiden / NL; Montreal / CA)
Proteogenomics integrates additional genomic information with proteomic characterization to enable the identification of protein variants specific to a genotype. To that end, proteogenomic characterization of, for example, a human sample, is the personalized proteomic analysis of the sample based on the genomic data of that person. We have developed Proteogenomic Improved and Guided Quantification Pipeline – PIGQpipe, a software tool to design targeted proteogenomic experiments. PIGQpipe draws upon our experimentally-derived assay development experience and integrates that with data from public online databases to facilitate targeted proteogenomics assay design. PIGQpipe features interfaces for Genomic Data Commons (GDC) and Ensembl data. GDC curates and provides access to somatic mutations identified in cancer patients, while Ensembl documents a broader range including both germline and somatic mutations, across multiple species (Figure 1). Our tool facilitates also analysis of COSMIC Cell Lines Project (CLP) data. To process the genomic data, we developed a set of routines packaged as an R library. This library features parsing functionality for Human Genome Variation Society (HGVS) nomenclature, enabling the prediction of potential protein consequences from standardized genetic variant descriptions. This is crucial, as variants like frameshift, termination, and untranslated region (UTR) variants, require more advanced handling logic than simple single amino acid substitution. This feature allows our pipeline to integrate with various databases and pipelines focused on genomic variants.
Using the Jurkat cell line with TP53 as a target protein and considering three proteases (glutamyl endopeptidase; neutrophil elastase; trypsin) analysis takes approximately 20 seconds on a consumer laptop to produce all possible theoretical targeted assays that can be used to quantify known TP53 mutations specific to that cell line. We have processed all known human germline mutations from Ensembl to generate an overview, and we have also processed all somatic mutations in all 1000+ sequenced cell lines available for research, as hosted by the COSMIC knowledgebase (Figure 1). The results suggest to use neutrophil elastase to increase coverage of detecting mutation consequences on the protein level. To validate PIGQpipe, we reanalyzed publicly available deep proteomics datasets from 9 cell lines representing the 9 cancer tissues of the NCI-60 [PMID: 23933261]; available from PRIDE. We have also acquired in-depth proteomics datasets from two melanoma patient-derived cells using trypsin and neutrophil elastase as proteases. The experimental data affirmed the presence of protein variants predicted by our pipeline. Although our goal has focused on protegeno-typic peptides for targeted proteogenomics assays, PIGQpipe is generic in its use and can help general proteogenomics data analysis.