Metaproteomics is a powerful tool to characterize how microbiota function by analyzing their proteic content by tandem mass spectrometry. Given the complexity of these samples, accurately assessing their taxonomical composition without prior information and based solely on peptide sequences remains a challenge. Here, we present LineageFilter, a new python-based AI software for refined proteotyping of complex samples using metaproteomics raw data and machine learning. Given a tentative list of taxa, their abundance, and the scores associated to their identified peptides, LineageFilter computes a comprehensive set of features for each identified taxon at all taxonomical ranks. Its machine-learning model assesses the likelihood of each taxon's presence based on these features, enabling efficient filtration of false-positive taxa.
LineageFilter represents a robust and powerful tool to improve identification of the taxa composition of complex multispecies samples, paving the way for a better proteins" annotation and functional annotation for metaproteomics. The versatility of LF is showcased by its effectiveness in handling noisy data, allowing for the extraction of a higher number of proteins validated at FDR 1%. As presented, the LF method could be easily implemented in external tools and pipelines supporting large-scale data analysis and currently using Unipept, to improve proteotyping sensitivity and recall. Furthermore, despite the recommended default settings, LF could easily be customized with other parameters, as long as the training is performed with training datasets filtered at the required PSM confidence.
In conclusion, the use of LF, prior to further metaproteomics analysis, enhances the efficiency and particularly the precision of standard metaproteomics tools. When applied to the proteotyping of metaproteomics datasets, it allows a better description of how microbiota function by enlarging the list of proteins identified and certifying their origins.