Michał Burdukiewicz (Białystok / PL; Barcelona / ES), Carlos Pintado-Grima (Barcelona / ES), Oriol Bárcenas (Barcelona / ES), Valentín Iglesias (Barcelona / ES), Eva Arribas (Barcelona / ES), Salvador Ventura (Barcelona / ES)
Background
Proteins self-organize in dynamic cellular contexts by assembling into reversible biomolecular condensates by liquid-liquid phase separation (LLPS). These condensates can be composed of single or multiple proteins with different roles in the ensemble's structural and functional integrity. Driver proteins form condensates autonomously, without the requirement of any additional partner, while client proteins are later recruited into them and are not essential for maintaining the condensate architecture. Although several databases have been developed to catalog proteins undergoing LLPS, they often contain divergent data, reflecting distinct conceptions of the LLPS phenomenon. This inconsistency impedes interoperability between these resources, hindering the integrative usage of their contents. Moreover, there is an apparent missing consensus on selecting proteins without any explicit experimental association with condensates (negative data). These two aspects have prevented the generation of reliable predictive models and fair benchmarks.
Results
To alleviate these constraints, in this work, we have carefully explored protein information from all relevant LLPS databases to generate confident datasets of client and driver proteins through an integrated biocuration protocol. Besides, to the best of our knowledge, we introduce for the first time standardized negative datasets, including both globular and disordered proteins. To validate our datasets, we investigated specific physicochemical properties related to LLPS across different sets of sequences and observed that specific features can be used to distinguish drivers and clients of LLPS but also against negative proteins (non-participants of condensates).
Conclusions
The datasets presented here provide a reliable means for confidently assessing the specific roles of proteins in LLPS and for identifying key differences in physicochemical properties underlying phase separation. These highly confident datasets are poised to train a new generation of multilabel models, build more standardized benchmarks, and mitigate sequential biases associated with the presence of IDRs.