Advancements in mass spectrometry (MS) technologies have revolutionized the quantification of proteins, their modifications, and metabolites in biological samples, significantly accelerating biomedical research. However, MS-based proteomic and metabolomic datasets are often plagued by missing values due to a complex interplay of missing at random (MAR) and missing not at random (MNAR) mechanisms. Such missing data, if not properly addressed, can result in both false positive and false negative results in downstream analyses and interpretations. Existing imputation methods typically focus on either MAR, using Bayesian PCA, random forest, and k-nearest neighbors approach, or MNAR, using left-censored methods like quantile regression imputation of left-censored data (QRILC), deterministic minimum (MinDet), and probabilistic minimum (MinProb). However, few techniques effectively address both MAR and MNAR within a single framework and those that do often require manual tuning of mixture percentages between MNAR and MAR or rely on specific two-group experimental designs. This is inadequate for many biomedical studies, which involve large, heterogeneous patient samples and often require the usage of machine learning techniques on complete data matrices.
To improve the handling of missing values in MS-based proteomic and metabolomic data from biomedical samples, we developed msBayesImpute, a novel computational method that combines Bayesian factorization with a probabilistic dropout model. We benchmarked msBayesImpute against several popular imputation methods using both simulated and real-world datasets, demonstrating superior performance in reconstructing the missing values or identifying differentially expressed proteins/metabolites, across various levels of missingness. Our model employs stochastic variational inference, offering considerable speed advantages over Bayesian PCA and missForest, especially in large datasets. Importantly, msBayesImpute does not require predefined experimental design information, such as group structures, making it highly adaptable to various study designs and compatible with any downstream analytical tools. This versatility makes msBayesImpute an effective and robust tool for enhancing the utility of MS datasets in biological research.