Francesco Greco (Pisa / IT), Aldo Pastore (Pisa / IT), Michele Emdin (Pisa / IT), Liam A. McDonnell (Pisa / IT)
Background: Sample variability in proteomics comes from both biological and technical sources, but it can be overcome by an increase in the sample size. The estimation of the sample size necessary for (pre-)clinical experiments is essential to ensure its success. Despite the fact that each protein detected in a proteomics experiment is measured with its own coefficient-of-variation, too frequently sample size estimations are made using CV values from single proteins or using an average CV value. Here we calculate the number of samples needed to observe statistically significant differences for each protein in a dataset, to estimate the adequate sample size to statistically access most of the quantified proteins.
Methods: We performed sample size estimation on publicly available datasets by using 1) the power analysis formula applying the Bonferroni correction; 2) generating an artificial dataset of true positives differences and repeating the t-test after sample size reduction; and 3) reducing sample size in empirical two-group statistical comparisons. Proteins accessible with smaller sample sizes were subjected to enrichment analysis to investigate the bias that small sample size has on gene ontology analysis.
Results: The fraction of statistically accessible proteins varied with the characteristic variance of each dataset. In general, 10 vs 10 replicates per groups was the minimum number to be able to detect an effect size of 1 (log 2 transformed data) in 67-93% of the proteins of the datasets. Several KEGG pathways were found to be highlighted simply by the choice of a small sample amount, including Proteasome, Protein processing in endoplasmic reticulum, Prion disease, Parkinson and Alzheimer disease, Endocytosis, and Amyotrophic lateral sclerosis.
Conclusions: The choice of larger sample sizes is crucial since it allows to statistically access larger fractions of the quantified proteins. Insufficient sample size may result in lack of power, but also might introduce bias into the enrichment analysis.