Tingpeng Yang (Shenzhen / CN), Tianze Ling (Beijing / CN), Boyan Sun (Beijing / CN), Zhendong Liang (Shenzhen / CN), Fan Xu (Shenzhen / CN), Xiansong Huang (Shenzhen / CN), Linhai Xie (Beijing / CN), Yonghong He (Shenzhen / CN), Leyuan Li (Beijing / CN), Fuchu He (Beijing / CN), Yu Wang (Shenzhen / CN), Cheng Chang (Beijing / CN)
De novo peptide sequencing is the mainstream technique to identify novel peptides, highlighting the performance improvements for the state-of-the-art models. The gap between amino acid recall and peptide recall is usually huge because any amino acid parsing error in a peptide will result in the parsing error of that peptide, presenting a significant challenge in de novo peptide sequencing. Therefore, improving the performance of de novo peptide sequencing models not only requires a higher amino acid recall but also a smaller gap between the amino acid recall and the peptide recall. Here, inspired by the idiom "Many hands make light work", we introduce the bidirectional parsing system (generating peptides in two directions: in two directions: from C-terminal to N-terminal and from N-terminal to C-terminal) to narrow the gap and enhance the peptide recall for de novo peptide sequencing models. We first developed two candidate bidirectional parsing systems, namely encoder-independent (EI) and encoder-shared (ES) systems, and demonstrated their effectiveness in enhancing the performance of de novo peptide sequencing models (i.e., Casanovo, π-HelixNovoV2, etc.) utilizing a series of ablation experiments. Meanwhile, we compared the performances of the EI and ES systems and confirmed the details of the bidirectional parsing system. Then, we applied the system to our previous study π-HelixNovoV1 and upgraded it to π-HelixNovoV2. Afterward, we demonstrated that π-HelixNovoV2 outperforms other state-of-the-art models (i.e., PEAKS, pNovo, DeepNovo, PointNovo, Casanovo, π-HelixNovoV1, NovoB, PepNet and GraphNovo) using a series of comparative experiments. Moreover, we trained a powerful π-HelixNovoV2 utilizing a larger training dataset, and as expected, π-HelixNovoV2 achieves unprecedented performance, even for peptide-spectrum matches with never-before-seen peptide sequences and the mass spectra with the missing fragmentation problem. We also used the powerful π-HelixNovoV2 to identify antibody peptides, multi-enzyme cleavage peptides, and non-enzymatic peptides, and π-HelixNovoV2 demonstrates robust and superior performance in these applications, with notable improvements achieved through fine-tuning. Finally, we proposed a quality control strategy for de novo peptide sequencing, and π-HelixNovoV2 significantly improves the sensitivity of peptide identification under the quality control strategy. We also utilized π-HelixNovoV2 with the quality control strategy to de novo gut metaproteome peptides and the results show π-HelixNovoV2 increases the identification coverage and accuracy of gut metaproteome and enhances the taxonomic resolution of gut metaproteome. Our results demonstrate the effectivity of the π-HelixNovoV2 and take a significant step forward in de novo peptide sequencing.