Abstract
Small proteins with fewer than 100, particularly fewer than 50, amino acids still need to be explored. Nonetheless, they are essential to bacteria's often ignored genetic repertoire. In recent years, the development of ribosome profiling protocols has led to an increasing number of previously unknown small proteins. Despite this, they are still frequently overlooked by automated genome annotation pipelines, and often, no functional descriptions can be assigned due to a lack of known homologs. The current abundance of small proteins in existing databases was evaluated to understand and overcome these limitations, and a new dedicated database for small proteins and their potential functions was created.To this end, small proteins were extracted from annotated bacterial genomes in the GenBank database. Subsequently, they were quality-filtered, compared, and complemented with proteins from the Swiss-Prot, UniProt, and SmProt databases to ensure a reliable identification and characterization of small proteins. Families of similar small proteins were created using bidirectional best BLAST hits followed by Markov clustering.Analysis of small proteins in public databases revealed that their number still needs to be improved due to historical and technical limitations. Additionally, functional descriptions were often missing despite the presence of potential homologs. As expected, a taxonomic bias demonstrated an overrepresentation of clinically relevant bacteria. This new and comprehensive database is accessible via a feature-rich website providing specialized search features for sORFs and small proteins of high quality. It also includes Hidden Markov Models for small protein families and information on taxonomic distribution and other physicochemical properties.In conclusion, the novel small protein database sORFdb is a specialized, taxonomy-independent database that improves the findability of sORFs, small proteins, and their functions in bacteria, thereby supporting their future detection and consistent annotation. All sORFdb data is freely accessible at https://sorfdb.computational.bio.