Aaron Maurais (Seattle, WA / US), Michael Riffle (Seattle, WA / US), Nathan J. Edwards (Washington, DC / US), Gennifer E. Merrihew (Seattle, WA / US), Brian Connolly (Seattle, WA / US), Matthew Chambers (Seattle, WA / US), Brendan MacLean (Seattle, WA / US), Julia E. Robbins (Seattle, WA / US), Ratna Thangudu (Reston, VA / US), Brian Searle (Columbus, OH / US), Christine Wu (Seattle, WA / US), Paul Rudnick (Bainbridge Island, WA / US), Michael J MacCoss (Seattle, WA / US)
We've developed a fast and scalable NextFlow pipeline for the analysis of large Data Independent Acquisition (DIA) datasets. The pipeline starts from vendor raw files which can be stored on either PanoramaWeb, the NCI Proteomic Data Commons (PDC), or a local workstation. The raw files are downloaded if necessary and converted to mzML using MSConvert. Each file is analyzed with EncyclopeDIA in parallel then merged into a single library. The merged library and mzMLs are then imported into a Skyline document for quantification and visualization. Peptide and protein or gene level matrices are exported from the Skyline document and used to generate a quality control (QC) report to enable an assessment of data quality. To run the pipeline, the only software which must be installed is the NextFlow engine. The docker containers necessary to run each step are downloaded on the fly as the workflow is running. The pipeline can be run either in the cloud with AWS Batch or on a local workstation.
We illustrate the utility of the pipeline with three use cases. In the first use case the pipeline is used to download and analyze raw files and metadata from several DIA studies in the PDC. To facilitate data harmonization between existing DDA studies, gene level reports are automatically generated in the same format used by the DDA CDAP. In the second use case, lysates from the NCI-7 panel of reference cancer cell lines were sent to 16 groups in the International Cancer Proteogenome Consortium (ICPC). The pipeline was used to analyze data from 11 of the 21 datasets which were acquired by DIA. After normalization and batch correction, each cell type clustered together, regardless of the lab the samples were acquired in. A third use case was to analyze data for a challenge from the Intelligence Advanced Research Projects Activity (IARPA) with 745 samples.