Tine Claeys (Ghent / BE), Yasset Perez Riverol (Cambridge / GB), Kris Gevaert (Ghent / BE), Juan Vizcaino (Cambridge / GB), Lennart Martens (Ghent / BE)
Large-scale data analysis in omics is becoming increasingly prevalent, driven by advancements in technical and data analysis methods. These advancements produce vast datasets with significant potential for AI applications and multi-omics pipelines, uncovering novel biological knowledge. One essential factor in these endeavours is the presence of comprehensive metadata, ensuring interpretability. This is however notably lacking within proteomics. The Sample and Data Relationship Format (SDRF) for Proteomics, introduced in 2020, maps data files to sample characteristics using ontology terms. Despite its potential, its complexity and the absence of easy-to-use annotation tools have limited its use. To address this issue, we have developed lesSDRF, a user-friendly, web-accessible application facilitating the SDRF annotation process.
lesSDRF is designed for ease of use, requiring no installation and available online via Streamlit. It includes all necessary ontologies for SDRF-Proteomics, eliminating the need to navigate the Ontology Lookup Service and ensuring adherence to standardized terminology. The application guides users through five steps: species selection, metadata file upload, labeling information, and input of required and additional columns using ontology terms. This structured approach ensures SDRF-Proteomics compliance and user convenience. Importantly, lesSDRF relies heavily on community input, contributing to the development of new SDRF-Proteomics standards for emerging subfields within proteomics, including single-cell proteomics, metaproteomics, and metabolomics. The application is available at https://lessdrf.streamlit.app/
Despite only being published in Nature Communications in October 2023, lesSDRF has already demonstrated a clear impact, with numerous datasets featuring lesSDRF-generated metadata annotation submitted to PRIDE in May 2024. Current developments include integration into the PRIDE submission pipeline, creation of a desktop application, and incorporation into automated proteomics reanalysis pipelines like quant-ms, open-ms, and wombat-p. To minimize user effort, search engine developers such as FragPipe, PeptideShaker, ionbot, and MSAmanda are integrating the export of a Technical SDRF file into their tools. This feature provides a partially filled SDRF-Proteomics file with technical details in the desired format. Ongoing discussions aim to expand this effort, while MSStats and MSqRob support SDRF-Proteomics compatibility for experimental design files.
This metadata initiative extends beyond proteomics. Recognizing the potential of multi-omics analysis, there is a need for a comprehensive metadata standard across various fields. Collaborating with the Research Data Alliance, we are working to establish a multi-omics metadata standard that includes proteomics, genomics, metabolomics, phenomics, and beyond.