Alireza Nameni (Ghent / BE), Robbin Bouwmeester (Ghent / BE), Lennart Martens (Ghent / BE), Sven Degroeve (Ghent / BE)
In MS-based proteomics, search engines are used to match peptides to observed tandem MS spectra. Liquid chromatography coupled with mass spectrometry (LC-MS) generates complex data requiring specialized tools for interpretation. These tools calculate the peptide-spectrum-match (PSM) score, estimating the likelihood of a peptide being the source of an observed spectrum. Machine learning rescoring methods combine several scoring metrics into a single PSM score to differentiate true target PSMs from decoys (false matches). While more complex models like gradient tree boosting (XGBoost) increase PSM identification rates, they also risk overfitting the target proteome database, resulting in overly optimistic false discovery rate (FDR) statistics.
To investigate the potential overfitting of complex models, we used an entrapment sequence database approach. This method expands the target database with peptides from nine randomly shuffled versions of the target proteome to attract the majority of random (incorrect) target PSMs. This ensures that the majority of PSMs aligning with the true target database are indeed correct. We also implemented a feature subset selection algorithm to assess individual feature contributions to the overfitting of the rescoring model.
Different models were evaluated using spectrum files from eleven LC-MS experiments of varying sizes. We performed searches with four different search engines: Andromeda, Comet, MSGF, and MSGF against the human UniProt database. Rescoring was conducted using five different algorithms: Logistic Regression, multi-layer perceptron, linear support vector machine (LSVM), XGBoost models, and Random Forest that increase in their complexity of fitting the rescoring function.
Increasing model complexity enhances PSM rescoring sensitivity, resulting in an average identification increase of about 11%. However, this sensitivity comes at the cost of higher entrapment FDR, with an average increase of about 500%. This suggests that complex models introduce a bias towards the target proteome database, leading to overly optimistic FDR estimates.
While complex models improve PSM identification, they also increase bias towards (including random wrong matches) from the target proteome database, leading to overly optimistic FDR estimates. Our study underscores the need for careful evaluation of model complexity and feature selection to mitigate overfitting in PSM rescoring.