Chengxin Dai (Beijing / CN), Julianus Pfeuffer (Berlin / DE), Hong Wang (Chongqing / CN), Ping Zheng (Chongqing / CN), Lukas Käll (Stockholm / SE), Timo Sachsenberg (Tuebingen / DE), Vadim Demichev (Berlin / DE), Mingze Bai (Chongqing / CN), Oliver Kohlbacher (Tuebingen / DE), Yasset Perez Riverol (Cambridgeshire / GB)
Background
The increasing volume of public proteomics data presents significant computational challenges for large-scale reanalysis. Addressing this, we introduce quantms (https://quantms.org/), an open-source, cloud-based pipeline designed for massively parallel proteomics data analysis. The quantms.org resource includes results for 16,599 proteins, with 16,270 quantified in normal tissues, 11,374 in cell lines, and 4,993 in human plasma.
Methods
Our workflow for proteomics data analysis begins with parsing input files and supporting peptide identification through multiple search engines like SAGE, Comet, and MSGF+, integrating tools such as ConsensusID for scoring and protein inference, and supporting post-translational modifications with LuciPHOr2. For label-free quantification (LFQ), we developed proteomicsLFQ, which employs both spectral counting and intensity-based methods. At the same time, for isobaric labelling, quantms uses IsobaricAnalyzer for reporter ion normalisation and benchmarking against gold-standard datasets. Data-independent acquisition (DIA) analysis is facilitated by parallelising the DIA-NN tool across compute nodes, ensuring robust identification and quantification even at varying protein concentrations. The downstream analysis integrates MSstats for differential expression and pmultiqc for quality control, with all tools available as BioConda packages and BioContainers, ensuring compatibility with various computational infrastructures and seamless submission to PRIDE and ProteomeXchange.
Results
Using quantms, we reanalyzed 83 public ProteomeXchange datasets (Figure), which included 29,354 instrument files from 13,132 human samples. The pipeline quantified 16,599 proteins based on 1.03 million unique peptides. quantms demonstrated superior performance and scalability compared to traditional tools like MaxQuant. For datasets exceeding 1,000 instrument files, quantms performed up to 40 times faster. Benchmarking against MaxQuant showed that quantms could quantify more proteins with similar accuracy, although it underestimated fold changes at low concentrations. In specific reanalyses, quantms identified more differentially expressed proteins and unique peptides across various tissues and conditions. Additionally, it successfully processed 118 human datasets, highlighting its robustness and efficiency.
Conclusions
quantms addresses major bottlenecks in the field of proteomics by enabling automated, large-scale quantitative analysis in cloud and high-performance computing environments. Its modular, open-source nature allows for continuous updates and integration of new tools and workflows. quantms improves the reproducibility and portability of proteomics data analysis, making it a valuable resource for the proteomics community. Future developments will focus on expanding the repository of reanalyzed datasets and integrating protein expression profiles with other omics data to facilitate advanced biological insights.