Poster

  • WS1.P001

Data processing workflows with nextflow and a ptychography use case

Presented in

Poster session WS 1: Data management

Poster topics

Authors

Pablo Fernandez Robledo (Berlin / DE), Thomas C. Pekin (Berlin / DE), Christoph T. Koch (Berlin / DE)

Abstract

Abstract text (incl. figure legends and references)

As electron microscopy experimental analysis has become more complex, involving multiple steps (each of which requiring its own analysis code) and large amounts of data, there is a need to organize and scale up this process beyond the personal computer. Ensuring reproducibility from raw data to end results is paramount. Due to the large amounts of data involved, there is a growing trend of moving from personal computers to a server (or cluster), involving interconnected machines. Traditionally, this meant writing specialized code that works on a specific cluster, i.e. the code could not easily be run on a personal computer for testing, or on a different cluster[1]. With the advent of workflow engines, some processing workflows can be implemented in a more hardware independent form[2], with greater ease and usability. This also facilitates the sharing of workflows.

To demonstrate this software paradigm, we implement a ptychographic [3] reconstruction workflow with the workflow engine Nextflow [4], which is infrastructure independent. Within the workflow, processes are described and input/output relations mapped. Thus, possibilities for asynchronous execution are automatically identified and performed. Parallel execution of Nextflow processes is also possible, and done automatically when given multiple inputs that independently traverse the workflow steps. Parallelization within one dataset can also be done via explicit Nextflow constructs. One key advantage of Nextflow is the ability to mix different tools with each other. For example, the output of a Python script can be passed on to be processed by a compiled executable.

Our implemented workflow has as input the raw data files from the microscope as well as a configuration yaml file which describes what settings should be applied for each step. This hierarchical file also controls which processing steps should be executed. As Nextflow does not use a deep hierarchy in its yaml configuration file, but this was desired for our workflow, custom functions were written to extend the functionality of Nextflow to accommodate this.

The workflow itself has the task of preprocessing the raw microscope data and then performing a ptychographic reconstruction. Additional postprocessing steps within the workflow can then follow.

The resulting workflow was developed and executed on a laptop and later executed on a local cluster consisting of one head node and two execution nodes managed by slurm [5]. A shared filesystem between the three machines was also required by Nextflow.

Here, we successfully demonstrate the use of a workflows for electron microscopy data. The use of workflows can ease the transition of code from a personal computer to the cluster, while also emphasizing reproducibility and shareability, helping to manage the complexity in multistep data processing.

[1] C Schiefer et al, (2020), https://doi.org/10.48550/arXiv.2006.03104

[2] F Lehmann, (2021), Proceedings of the CIKM 2021 Workshops http://ceur-ws.org/Vol-3052/short12.pdf

[3] MJ Humphry et al., Nat. Comm. 3, (2012), p.1. https://doi.org/10.1038/ncomms1733

[4] P Di Tommaso et al., Nat. Biotechnol. 35, (2017), p. 316. https://doi.org/10.1038/nbt.3820

[5] Jette, M.; Grondona, M., Proceedings of ClusterWorld Conference and Expo, (June 2003). "SLURM: Simple Linux Utility for Resource Management" (PDF)

  • © Conventus Congressmanagement & Marketing GmbH