Most research aiming to understand the molecular foundations of life and disease has focused on a limited set of increasingly well-known proteins. By contrast, thousands of human proteins remain "understudied": their biological function is poorly understood and annotation of their molecular properties is scarce. This annotation inequality hinders biomedical progress because mechanistic investigations of gene–disease associations typically focus on proteins that are already well known, a phenomenon also known as the street-light effect (Kustatscher et al, Nat Methods 2022, PMID 35534633).
We have previously shown that protein covariation (coexpression) analysis is a powerful proteomics approach to link uncharacterised proteins to known cellular processes (Messner et al, Cell 2023, PMID 37080200; Kustatscher et al, Nat Biotechnol 2019, PMID 31690884). Here, we present a new covariation map of the human proteome. To assemble it, we developed a proteomics data processing pipeline that is fast, scalable, and includes appropriate FDR control for very large datasets. Based on the Fragpipe platform, this pipeline enabled us to re-process 23,000 previously published MS runs in less than two weeks. This resulted in the ProteomeHD.2 dataset, which covers the abundance changes of 16,000 proteins in response to 2,500 biological perturbations, quantified using SILAC labelling. This includes many microproteins for which proteomic evidence had so far been lacking (Kourtis et al, in preparation).
To determine which proteins have similar covariation patterns across ProteomeHD.2, and might thus be functionally related, we developed a machine-learning-based strategy. This strategy not only considers whether two proteins have correlated expression but also incorporates additional information from the proteomics data that reflect the robustness and reliability of the underlying protein quantitations. The resulting proteome covariation map reveals functional associations for about 10,000 human proteins. We show that it captures functional associations as well as more established techniques, such as affinity-purification MS experiments, allowing us to make predictions on the potential biological functions of many previously unidentified and understudied human proteins.