Yuliya Burankova (Hamburg / DE; Freising / DE), Miriam Abele (Freising / DE), Mohammad Bakhtiari (Hamburg / DE), Fabian Gruhn (Munich / DE), Christine von Toerne (Munich / DE), Teresa Barth (Planegg / DE), Lisa Schweizer (Martinsried / DE), Pieter Giesbertz (Goettingen / DE), Johannes Schmidt (Leipzig / DE), Stefan Kalkhof (Leipzig / DE; Coburg / DE), Janina Müller-Deile (Erlangen / DE), Yassene Mohammed (Leiden / NL), Elke Hammer (Greifswald / DE), Lis Arend (Hamburg / DE; Freising / DE), Klaudia Adamowicz (Hamburg / DE), Tanja Laske (Hamburg / DE), Anne Hartebrodt (Erlangen / DE), Tobias Frisch (Odense / DK), Chen Meng (Freising / DE), Julian Matschinske (Hamburg / DE), Julian Späth (Hamburg / DE), Richard Röttger (Odense / DK), Veit Schwämmle (Odense / DK), Stefanie M. Hauck (Munich / DE), Stefan Lichtenthaler (Goettingen / DE), Axel Imhof (Planegg / DE), Matthias Mann (Martinsried / DE), Christina Ludwig (Freising / DE), Bernhard Küster (Freising / DE), Jan Baumbach (Hamburg / DE), Olga Zolotareva (Hamburg / DE; Freising / DE)
Expanding proteomics data availability unlocks potential for large-scale biomedical research, but integrating patient-derived mass spectrometry data can be problematic due to privacy concerns. Protein expression profiles, just like transcriptomics data, can be subject to genotype reconstruction attacks and must be treated as confidential information [1]. To enable privacy-preserving, robust analysis of distributed proteomic data despite data heterogeneity, we designed FedProt — the first tool for federated differential protein abundance analysis.
FedProt represents the mathematical equivalent of DEqMS, a state-of-art modification of the limma method [2], and utilizes the hybrid approach of federated learning (FL) [3] and additive secret sharing [4]. In FL, multiple parties holding sensitive data jointly participate in the computation without revealing their data. This involves splitting the workflow into steps performed locally and aggregating results globally by a trusted server. In FedProt, this approach allows us to obtain the same result as centralized pooled data analysis, but without violating privacy.
We evaluated FedProt using two multi-center datasets, a TMT human plasma dataset of 60 samples (2 conditions, 3 cohorts) and a LFQ dataset of 118 Escherichia coli samples (2 conditions, 5 cohorts). All MS data were uniformly preprocessed, quantified, and submitted to FedProt, which identified differentially abundant protein groups. We conducted a central DEqMS analysis as a baseline and compared the results of FedProt against it. Additionally, we studied class label and cohort size imbalance using simulated data.
In all tests, FedProt effectively handled batch effects and produced results that closely matched the results of DEqMS baseline; fold-changes and negative log-transformed adjusted p-values were almost identical (maximal differences no greater than 1*10-11). It can analyze proteins absent in some cohorts and matches centralized analysis more precisely than typical meta-analyses like Fisher's or Stouffer's methods, RankProd, or the random effects model, representing the only privacy-preserving approach to differential protein abundance analysis to date.
Using FL and additive secret sharing, FedProt effectively manages proteomic data complexity, including missing values and batch effects, enhancing sample sizes and statistical power while minimizing privacy risks without compromising accuracy. Accessible as a user-friendly FeatureCloud App (https://featurecloud.ai/app/fedprot), FedProt could catalyze larger collaborations and advance proteomics research, improving the robustness of differential protein abundance analysis results.
Geyer, P.E., et al. Mol Cell Proteomics. 20, 2021, 100035.Zhu, Y., et al. Mol Cell Proteomics. 19, 2020, 1047–1057.McMahan, B., et al. PMLR, AI and Statistics, 2017, pp. 1273–1282.Cramer, R., et al., 2015. SMPC. Cambridge University Press.