Marcel Thiel (Gdańsk / PL), Michał Puchalski (Gdańsk / PL), Alicja Różycka (Gdańsk / PL), Stanisław Ołdziej (Gdańsk / PL)
The results of proteomic/peptidomic studies are highly dependent on the amino acid sequence databases used during analysis. In recent years, due to a deeper understanding of the biological systems functions (by interpretation and information processing of the data related to the DNA sequences), amino acid sequence databases have been enriched with sequences resulting from alternative RNA splicing (isoforms), products of short ORFs or pseudogenes. However, even these modifications of amino acid sequence databases do not capture all of the diversity of the proteome. Amino acid sequence databases offer only canonical sequence representation of the product (or products) of a given gene. Moreover, the genetic material within a species is not identical for all individuals. Genomes (as well as proteomes) of individuals of the same species contain a huge number of natural variants. The occurrence of natural variants in proteomic/peptidomic studies is virtually impossible to identify due to the lack of systematized and readily available information, organized in a way suitable to use in available software used to process mass spectrometry data.
Based on public available data resources (Uniprot and NCBI's ClinVar databases), we built a database of natural variants of proteins as well ecosystem of bioinformatic tooling named AliceDB. It is used in identification of natural variants based on mass spectrometry data. Currently, AliceDB covers human proteome only (with more than 3,3 million records). Even that, it is worth to mention that software allows to build a database for investigation for any proteome. AliceDB is designed to work with commercially and non-commercially available software used to process and analyze mass spectrometry data.
The performance of the developed tool was tested on two datasets acquired from the PRIDE repository (PXD048837 and PXD024347 respectively). Preliminary results indicate that inclusion of natural variants in the processing of peptidomic/proteomic data allows to increase the number of peptide identifications by 2-11% (depending on the sample type). A higher number of identified peptides leads to the higher number of protein identifications (increase by 2-6%). As a result of that, statistical significance of amino acid sequence identifications also improves. It is worth to mention that significant potential for AliceDB usage lies in data analysis of medical importance. At present, information about possible natural variants comes mainly from DNA sequencing data (genomics) or mRNA sequencing data (transcriptomics). Identification of natural variants in protein products of genes, especially those of marker or diagnostic importance, was practically limited only to targeted analyses focused on a few predefined ones. AliceDB enables the identification of all natural variants, greatly expanding the application of mass spectrometry in studies related to medical diagnostics.