Krystyna Grzesiak (Wrocław / PL; Białystok / PL), Jakub Kała (Białystok / PL), Małgorzata Bogdan (Wrocław / PL), Michał Burdukiewicz (Białystok / PL)
Proteins and peptides function effectively because they adopt specific spatial conformations. However, experimentally determining these structures is both costly and time-consuming. To accelerate this process, computational methods are often employed, leveraging the readily available amino acid sequences of proteins. Training deep learning models on such data, however, necessitates a large volume of annotated sequences. For peptide-specific models, where experimentally labeled sequences are scarce, classical statistical methods are preferred. Unfortunately, current state-of-the-art methods for peptide property prediction often fail to fully utilize the information embedded in amino acid sequences due to inefficient feature representations.
Classical methods like Generalized Linear Models (GLMs) require structured data, leading us to focus on k-mer representations of proteins. These representations can be formatted as binary sequences indicating the presence of consecutive k-mers, or as integer count sequences. The protein properties we aim to predict span various data types, including binary (e.g., presence of disordered regions), categorical (e.g., subcellular location), and continuous (e.g., minimum inhibitory concentration of antimicrobial peptides).
Our objective is to clarify the relationship between these protein properties and their k-mer representations. To achieve this, we introduce an advanced data simulation framework and methods for motif identification from k-mer data using information criteria like mBIC2 and regularization models. We conducted comprehensive benchmarking of various feature selection techniques, including Fast Correlation-Based Filter Solution (FCBF) and QuiPT, in conjunction with GLM-based regularization techniques. Our benchmarks identified the optimal feature selection methods for high-dimensional k-mer data, establishing a robust framework for accurately predicting protein properties and detecting motifs.