Devon Kohler (Boston, MA / US), Karen Sachs (Washington, DC / US; Palo Alto, CA / US; San Diego, CA / US), Jeremy Zucker (Richland, WA / US), Benjamin M. Gyori (Boston, MA / US), Lindsay Pino (Seattle, WA / US), Olga Vitek (Boston, MA / US)
Understanding the proteome"s response to perturbations is an important step towards a full understanding of its function. Traditionally this is done experimentally, by perturbing the system using external mechanisms such as drugs and observing the resulting effects on protein abundance. Alternatively, machine learning methods, such as neural networks can be used to estimate the effect of perturbations by considering joint coregulation of many proteins. These approaches can be successful in some circumstances, however, they do not incorporate prior knowledge about mechanisms between proteins. In contrast, causal inference methods make explicit use of prior knowledge and have been shown to perform well in the estimation of perturbations, for example, using transcriptomics data. These methods take as input abundances from an observational experiment (i.e., an experiment that does not implement a perturbation), and a graph of known causal relationships, and estimate the impact of perturbations on the graph. Here we describe a novel approach to translating causal inference methods to mass spectrometry (MS)-based proteomics, moving beyond the characterization of associations.
The implementation of causal inference methods in the context of MS-based proteomics has a number of computational challenges that must be overcome in order for the methods to be effective. First, MS proteomics data is inherently noisy, is subject to batch effects and missing values, and quantifies peptide-fragments rather than proteins. Second, creating an accurate graph of causal relationships between proteins is non-trivial. We leverage the INDRA database of biological mechanisms which combines the content of pathway knowledgebases with literature mining to extract causal relations between the proteins quantified in the experiment. We employ a variety of filtering strategies to create a context-specific network with edges that support the data generating process. Finally, we develop a Bayesian model that leverages the network to model the relationships between proteins and is fit to the observational data, to perform model-based inference.
We validate this method on synthetic data, then demonstrate it on real biological experiments. We highlight areas where machine learning methods such as neural networks fail to accurately predict the effect of interventions and show how causal inference can overcome this. We use computer simulations to investigate the circumstances in which the causal inference methods are likely to succeed, such as favorable network topology and number of biological replicates. Finally, we show the accuracy of the method using an experiment which investigates the effect of drug compounds on the chromatin-binding activity of transcription factors. In this setting, the causal model is trained on the observational DMSO control data and evaluated by comparing the model"s predictions of the effect of drug interventions against experimental measurements.