Patrick Jensen (Lyngby / DK; Luebeck / DE), Jan Lellmann (Luebeck / DE), Herbert Thiele (Luebeck / DE), Pia Hönscheid (Dresden / DE), Christian Sperling (Dresden / DE), Gustavo Baretton (Dresden / DE), Oliver Klein (Berlin / DE), Carsten Tschöpe (Berlin / DE), Karin Klingel (Tuebingen / DE)
Matrix-assisted-laser-desorption/ionization mass spectrometry imaging (MALDI MSI) allows us to determine the spatial distribution of tissue analytes such as peptide signatures via their charge-to-mass ratio. This highly sensitive description of the tissue makes it attractive for clinical use as a supplement to existing diagnostic procedures and automated tools. Here, methods based on deep learning are emerging as a powerful paradigm for the analysis of MALDI MSI data. However, as the application to MALDI MSI data is still fairly nascent, unified methodologies transferable from one laboratory to another have yet to emerge. This makes adoption by clinical practitioners challenging, as a substantial amount of domain-level expertise is required to apply deep learning methods successfully.
This work presents a common pipeline for deep learning-based MALDI MSI analysis of human tissue samples. We focus on the case of spectrum classification. Our pipeline is summarized in the figure. First, each spectrum is preprocessed by baseline removal, normalization (e.g., by total ion count), and downsampling. Note that, in contrast to most established approaches, we do not need to make use of any peak picking or other dimensionality reduction techniques. Then, we train a deep learning model to predict the class for a given spectrum. Specifically, we use a 1-dimensional Visual Transformer and optimize the cross-entropy loss with the AdamW optimizer. To increase the robustness of the model, we augment the spectra with Gaussian noise, random intensity scaling and mixup during training.
We demonstrate that our pipeline is generally applicable by testing it on three MALDI MSI datasets from different clinical sources. For all datasets, our results are based on 3-fold cross-validation and we report the mean balanced accuracy (b.acc.) on the test sets for each cross-validation fold. We chose balanced accuracy as it is well-suited for imbalanced datasets. For the first dataset, the task is to distinguish between pancreatobiliary or intestinal type ampullary cancer (2 classes). Here, the pipeline achieves a b.acc. of 0.854. The second is a dataset where the task is to distinguish between AL and ATTR amyloidosis myocardial tissues (2 classes). Here, the pipeline achieves a b.acc. of 0.733. In the last dataset, the task is to distinguish between spectra from tissue containing pancreatic ductal adenocarcinoma or other pancreatic cancer types (4 classes). Here, the pipeline achieves a b.acc of 0.676.
Importantly, all training hyperparameters remain the same for all datasets: While one may achieve better scores by tuning these to each datasets, our goal is specifically to show that a simple pipeline can achieve promising results without further user interaction. Our aim is that this may serve as a common baseline and, importantly, a good starting point for further research. To this end, we have also made the source code available at: https://github.com/patmjen/maldi_dl