Mateusz Staniak (Wrocław / PL; Hasselt / BE), Ting Huang (Boston, MA / US), Amanda Figueroa-Navedo (Boston, MA / US), Devon Kohler (Boston, MA / US), Meena Choi (San Francisco, CA / US), Trent Hinkle (San Francisco, CA / US), Tracy Kleinheinz (San Francisco, CA / US), Robert Blake (San Francisco, CA / US), Christopher Rose (San Francisco, CA / US), Małgorzata Bogdan (Wrocław / PL; Lund / SE), Olga Vitek (Boston, MA / US)
Bottom-up mass spectrometry-based proteomics studies changes in protein abundance and structure across conditions. Since the currency of these experiments are peptides, i.e. subsets of protein sequences that carry the quantitative information, conclusions at a different level, e.g., at the level of proteins or of post-translational modifications, must be computationally inferred. The inference is particularly challenging in situations where the peptides are shared by multiple proteins or post-translational modifications. While many approaches infer the underlying abundances from unique peptides, there is a need to distinguish the quantitative patterns when peptides are shared.
From a statistical perspective, inclusion of shared peptides into the estimates of abundances or proteins or post-translational modifications models induces a data structure, in which observations (peptide intensities) may belong to multiple groups defined by proteins or modification sites. Typically, shared peptides are removed from analysis of MS data, which leads to loss of information, and in particular some proteins or post-translational modifications may be lost. Alternatively, proteins that share peptides are grouped together, eliminating the possibility of estimating their distinct quantitative patterns.
We introduce a statistical approach for estimating protein abundances, as well as site occupancies of post-translational modifications, based on quantitative information that includes shared peptides. This approach extends the existing MSstatsTMT framework for labeled MS data summarization and differential analysis by treating the quantitative patterns of shared peptides as convex combinations of abundances of individual proteins or modification sites, and estimating the abundance of each source in a sample together with the weights of the combination. Abundances estimated using this method can serve as input to statistical models that estimate differences between experimental conditions. We demonstrate the utility of this new summarization method using computer simulations and examples based on data from experiments with diverse biological objectives, including protein degradation and to changes in protein post-translational modifications.