Mostafa Kalhor (Freising / DE), Falk Schimweg (Berlin / DE), Mario Picciani (Freising / DE), Lutz Fischer (Berlin / DE), Juri Rappsilber (Berlin / DE), Mathias Wilhelm (Freising / DE)
Introduction: Chemical cross-linking (XL) mass spectrometry (MS) is an effective tool for analyzing protein structure and protein-protein interactions. Here, we extend Prosit to predict the fragment ion intensities for cleavable MS2 (CMS2), cleavable MS3 (CMS3), and non-cleavable MS2 (NMS2) cross-linked peptides covering the most well known crosslinkers (DSSO, DSBU, and DSS/BS3). Integrating XL-Prosit into Oktoberfest, a data-driven rescoring pipeline, and utilizing Percolator for rescoring resulted in up to 450%, 233%, and 265% improvement in confidently identified crosslink-spectrum matches (CSMs), unique cross-linked peptides (UXLs), and protein-protein (PP) interactions levels compared to xiSEARCH, respectively. XL-Prosit is available via the public prediction service Koina, and our rescoring pipeline is readily available, allowing immediate use by the community.
Methods: We systematically evaluated methods to fine-tune a pre-trained model of Prosit to allow the prediction for unknown modifications using the concept of transfer learning. In addition, we improved the accuracy of the prediction model by calibrating collision energy across datasets and augmenting training data by swapping the position of the two cross-linked peptides in the context of a model that focuses on the prediction of the intensity pattern of one. XL-Prosit was integrated into Oktoberfest (github.com/wilhelm-lab/oktoberfest) via Koina (koina.wilhelmlab.org) to extract novel intensity-based spectral similarity features for each unfiltered identified CSM obtained from a search engine (Figure A). Subsequently, Percolator was applied to efficiently discriminate correct from incorrect CMSs. This was achieved by training Percolator on PSM level (alpha and beta peptides separately) rather than at CSM level and generating CSM-, UXL-, and PPI-level scores for the final FDR calculation.
Results: Raw files were downloaded from PRIDE and analyzed using Plink 2 and XlinkX. Currently, 124k MS2 and 32k MS3 spectra for cleavable and 36k MS2 spectra for non-cleavable cross-linked peptides were collected. Our model's performance, evaluated on a test dataset, reveals an accuracy exceeding 0.96, 0.94 and 0.93 on CMS3, CMS2, and NMS2 based on Pearson correlation coefficient (PCC) (Figure B), respectively. To verify if our FDR estimate is well calibrated, we utilized XL-Prosit on one well-controlled XL-MS dataset standard comprising hundreds of recombinant proteins that are systematically mixed for cross-linking. Our pipeline achieves lower experimentally validated FDRs (0.67%, 0.88%, and 1.52%) compared to Scout alone (1.18%, 1.13%, and 2.3%), while improving identification rates by 66%, 63%, and 30% for CSMs, UXLs, and PPs, respectively, at an estimated 1% FDR. Last, we applied XL-Prosit to another large-scale dataset, which yielded a remarkable improvement, of 450% increase in CSMs, 233% in UXLs, and 265% in PPs compared to the xiSEARCH alone (Figure C), all below 1% FDR.