Shanji Zhang (Seoul / KR), Seunghyuk Choi (Houston, TX / US), Seungjin Na (Cheongju / KR), Eunok Paek (Seoul / KR)
De novo peptide sequencing via tandem mass spectrometry can be useful to discover novel, unknown peptides. Yet, its practical application is hampered by a lack of statistical validation. We introduce NovoCert, a novel method leveraging semi-supervised learning and statistical techniques to validate peptide spectrum matches (PSMs) inferred from de novo sequencing. NovoCert can independently validate high-confidence peptides, reducing reliance on scores from de novo peptide sequencing tools. Initially, PSMs were acquired by de novo peptide sequencing tandem mass spectra. The 'exact' group was determined by aligning de novo results with the reference protein sequences. For each PSM in the exact group, we generated a corresponding decoy PSM by de novo sequencing a 'reverse-shifted (RS)' spectrum: (1) a decoy peptide is produced by reversing each target peptide; (2) the target spectrum is converted into a decoy spectrum by adjusting the positions of its fragment ions to align with the computed m/z values expected for the decoy peptide. A 1% false discovery rate (FDR) for the exact group was estimated using Percolator. We used ProtomeTools synthetic peptide dataset (PXD004732) to perform decoy justification. We used Comet for database search and used PEAKS for de novo peptide sequencing. We compared distributions of XCorr, precursor mass error, IonFrac, deltaCn between results of searching target spectra using DecoyDB (reversed human protein sequences) and results from searching decoy spectra generated by RS and PS(Precursor Swap) methods against TargetDB (reference sequences of homo sapiens). We adopted the RS method because the RS method has distributions most similar to the target spectra searched against DecoyDB than PS method. Utilizing PEAKS, NovoCert identified 70% of PSMs (92.2% at the peptide level) within the exact group (2,438,270 total PSMs) at 1% FDR. We compared the results with the existing FDR estimation method for de novo sequencing, which we called ReverseDB method. The ReverseDB method generates a decoy PSM by matching de novo results with the reverse protein sequence DB to estimate FDR. We can see that NovoCert is a more conservative method and more reliable than the ReverseDB method. NovoCert demonstrated its effectiveness by improved identifications in terms of spectral angle, delta retention time, number of annotation peaks and the precursor mass error.