The fast-advancing mass spectrometry and related technologies have greatly extended the depth of coverage in various proteomics studies. However, it remains challenging to handle proteins with missing values that are often presented as "NA" (not available) as sample numbers grow rapidly. We can interpret "NA" as evidence of no expression, low expression below the detection threshold, or as a false negative detection due to technical issues. Delineating these issues will help us analyze the complex proteome organizations in various biological contexts.
In the current study, we first performed a comprehensive profiling of NA values in the proteomic datasets of over one thousand cell lines. We observed hundreds of non-NA proteins that are universally present in all cell lines. Such universal proteins are enriched in pathways of RNA processing, suggesting their essential roles in diverse cell types. Interestingly, handling NA values with a binary transformation had a robust performance in stratifying cell lines of suspension vs attachment culture conditions, and it also effectively separated non-small cell lung cancers (NSCLC) cell lines from small cell lung cancers (SCLC) ones. This prompted us to develop NA deconvolution analysis (NADA), a machine learning algorithm that extracts and integrates the features of neighboring samples/proteins to determine the nature of an NA value as "biological" or "technical". NADA could successfully identify characteristic proteins with binary expression patterns that classify cell lines of different tissue origins. For example, hematological and lymphoid cell lines are specifically missing proteins involved in cell adhesion and cell junction assembly pathways, but uniquely enrich proteins that are immune-related receptors, chemokines, and transcriptional factors. hematological origin from ones that grow on solid support. Moreover, NADA also facilitated the development of a missing-value-weighted method that improved protein-protein interaction (PPI) analyses via co-expression profiling. Finally, we applied NADA to analyze the proteomics datasets of extracellular vesicles (EVs) and identified differential PPI networks in EVs derived from cancerous and normal tissues. In conclusion, we present NADA as a powerful tool to delineate NA values and extract biological insights from the expanding datasets of proteomics.