Jesus Castano (Saint-Hyacinthe / CA; Montreal / CA), Francis Beaudry (Saint-Hyacinthe / CA; Montreal / CA)
The accurate identification of proteins and comprehensive coverage of the proteome are fundamental aspects of proteomics research. This can hugely impact the detection of proteins within complex biological samples, becoming a critical factor for paradigm-altering research in areas such as medicine, biofuels, or agriculture. Bottom-up proteomics using mass spectrometry relies on accurate comparison, facilitated by a search engine, between experimental and in silico spectra obtained from proteome databases. Over the last decade, the emergence of machine learning (ML) and deep learning (DL) approaches revolutionized this task; traditional comparisons that involved spectra with unit intensity peaks were transformed due to the use of ML and DL to predict spectral intensities. Furthermore, ML approaches integrated multiple matching features that were sub-utilized by search engines such as missed cleavages, retention time, and mass accuracy into a single score, in a process known as rescoring. Multiple rescoring functions have been developed in the past years, showing their potential to increase peptide identification. In this research, we compared several rescoring functions, offering some insights into their performance and discussing some caveats that might guide their selection for a specific proteomic experiment. We utilized HeLa digest samples, and we conducted a top 20 DDA experiment using a 50-cm separation column. The results were analyzed with the MaxQuant software at a 100%FDR, and the outputs were rescored using three different rescoring platforms: Oktoberfest, MS2Rescore, and inSPIRE. The number of peptides identified was substantially increased with all the platforms (40-53%) – with inSPIRE yielding the highest increase. Most of the peptides originally identified by MaxQuant were conserved after rescoring. However, a small percentage (3.5-5%) was lost, mainly because these peptides contained post-translational modifications that were not taken into account by the rescoring function. This could be an important limitation depending on the user"s needs. Another drawback is that rescoring increases processing time by up to 80%, not including the time spent in manual manipulations. Importantly, the need for command line-based software for this step can discourage less experienced users. Our results indicated that the peptide identification enhancement by the rescoring functions was related to their ability to detect a higher proportion of highly charged precursors and peptides with more missed cleavages, which mainly leveraged features from intensity peak prediction. In summary, the use of rescoring functions constitutes an important tool to significantly boost peptide identifications, which makes their implementation essential in proteomic pipelines. Their wide application, however, is still limited due to the low number of amino acid modifications included, increased time analysis, and the lack of integrated user-friendly platforms.