Quentin Giai Gianetto (Paris / FR), Louka Chenal (Paris / FR), Mariette Matondo (Paris / FR)
Highlighting the proteins/peptides exhibiting significant quantitative differences between two biological conditions is the cornerstone of many quantitative proteomic experiments. Typically, this involves employing a statistical strategy across the quantitative dataset to extract relevant proteins/peptides. Basic approaches include fold-change calculations or standard Student's t-tests. Various statistical tests have emerged to refine such analyses and to allow for better statistical power. Some adjust the denominator of the Student's t-statistic (e.g., Welch, SAM, LIMMA, BayesT, VarMixT, ROTS), while others are rank-based (e.g., Wilcoxon, LPE, RankProduct). Recent methods from explainable artificial intelligence were also developed to provide p-values (e.g., PIMP, NTA). Since the 2000s reproducibility crisis, skepticism about p-values has grown due to their reliance on hypothetical distributions and the need for many replicates to be reliable. This issue is notable in proteomics with limited experiment replicates. Alternatives like the « Bayesian framework », using measures like the region of practical equivalence (ROPE) or probability of direction (PD), have recently emerged.
Herein, we undertook a comprehensive analysis, comparing many statistical tests and strategies on various simulated and real data to assess their similarities and differences. Additionally, we explored ways of combining strategies to maximize the statistical power (or true non-discovery rate) while controlling the FDR (false discovery rate), using methods like p-value fusion (e.g., Fisher's, Stouffer's,…) or data-driven strategies (machine learning classification). For this, we investigated the estimation of the FDR when combining statistical strategies by estimating joint empirical cumulative distribution functions.
Our study proposes to exploit a variety of different tests to extract the best possible strategy in difficult contexts, such as to detect small variations in abundances between conditions (small effect sizes). We present the most successful statistical strategies and the combinations providing the best statistical power while controlling the false discovery rate under a fixed threshold. Our comprehensive approach yields valuable insights applicable to any quantitative dataset characterized by a "large p, small n" framework, commonly encountered in proteomics but also in other quantitative omics datasets, such as transcriptomics or metabolomics datasets.