Anna Sophia Kujat (Rostock / DE), Conor Glackin (Rostock / DE), Christiane Hassenrück (Rostock / DE), Lukas Vogel (Rostock / DE), Erik Zschaubitz (Rostock / DE), Stefan Lüdtke (Rostock / DE), Matthias Labrenz (Rostock / DE), Theodor Sperlea (Rostock / DE)
Aquatic bacterial communities are integral components of earths ecosystems. Investigating temporal and spatial community patterns provide insights into the dynamics of the entire ecosystem. To uncover such patterns, next-generation sequencing methods provide the necessary data. But these data sets are usually compositional, extremely high-dimensional and exhibit non-linear relationships, which makes them difficult to understand and to analyse using statistical methods.
To circumvent these problems and to identify understandable compositional patterns in their temporal and spatial dynamics, we applied Topic Modeling (TM), a method originally from computational linguistics, to the microbial communities of the Warnow estuary and the Baltic Sea coast. This environment is characterized by strongly fluctuating environmental conditions and a steep salinity gradient and was sampled in high spatio-temporal resolution between April 2022 and April 2023.
More specifically, we compared several TM algorithms as well as various data preprocessing methods in their ability to capture ecological relationships using a Random Forest-based machine learning approach. This way, we could verify how much information was retained through TM while minimizing the dimensionality of the feature space as much as possible.
Using this approach, we identified subcommunities (corresponding to topics) specific to freshwater and brackish environments and specific seasonal phases. These subcommunities can be linked across primer sets and, therefore, the domains of life. In the estuary, we are able to link the retention time of specific subcommunities to weather and hydrodynamical dynamics.
Based on these results, we conclude that TM achieves a dimensional reduction of microbiome datasets that significantly increases their comprehensibility. Taking advantage of structural similarities between linguistic and microbiome datasets, TM proves to be a powerful tool especially for the analysis of microbial communities in dynamic habitats.