Anthony Wu (Boston, MA / US), Klas Karis (Boston, MA / US), Lindsay Pino (Seattle, WA / US), Benjamin M. Gyori (Boston, MA / US), Olga Vitek (Boston, MA / US)
Interpreting results for MS-based proteomic experiments, particularly those from analyses of differential protein abundance, often involves statistical metrics such as lists of p-values that are devoid of biological context. There remains a challenge to contextualize these metrics with respect to existing biomolecular knowledge. Manually curated biological pathway databases offer valuable insights but they capture only a fraction of available literature, leaving many results unexplained. Especially in the case of early-stage discovery experiments, where few differentially abundant proteins are to be expected, the need for interpretation tools becomes critical to derive any meaningful hypotheses for future experiments.
To address this gap, we introduce a software integration between MSstats, a family of open- source packages for detecting differentially abundant proteins, and INDRA, an automated system leveraging natural language processing to extract biomolecular network information from biomedical literature at scale. Additionally, INDRA incorporates data from existing protein interaction databases, such as Phospho.ELM and Reactome. INDRA facilitates access to the extracted data by offering a convenient database client that enables users to programmatically query for relations. We implemented a new algorithm within MSstats that sends a list of differentially abundant proteins to INDRA"s REST API to retrieve a list of relevant biological networks. Using the list of biological networks, the algorithm constructs a network visualization that contains annotations from both MSstats and INDRA results, enabling further exploratory analysis. With this new integration, users can explore the list of proteins with significant p-values in the context of surrounding mechanisms, linked directly to context-specific literature evidence.
We demonstrate this integration through a data set measuring the effect of small molecule compounds on the activity of transcription factors. Intact, live cells from the THP-1 cell line were treated with one of eight compounds, then chromatin-bound proteins were quantified using data independent acquisition while identification and quantification were performed with DIA- NN. There were a total of 270 MS runs, 132 runs for the DMSO group and 16-18 runs for each compound. We processed the DIA results through an MSstats pipeline, incorporating an additional step to retrieve data from INDRA and visualize the results. Through this case study, we demonstrate that the new integration between MSstats and INDRA can uncover plausible biological explanations in a seamless way, even in cases where statistical results contain few differentially abundant proteins.
Our work highlights an essential step to bridging the gap between statistical analysis and biological interpretation, allowing researchers to derive meaningful insights from proteomics experiments.