1 Introduction

systemPipeRdata provides data analysis workflow templates compatible with the systemPipeR software package (H Backman and Girke 2016). The latter is a Workflow Management System (WMS) for designing and running end-to-end analysis workflows with automated report generation for a wide range of data analysis applications. Support for running external software is provided by a command-line interface (CLI) that adopts the Common Workflow Language (CWL). How to use systemPipeR is explained in its main vignette here. The workflow templates provided by systemPipeRdata come equipped with sample data and the necessary parameter files required to run a selected workflow. This setup simplifies the learning process of using systemPipeR, facilitates testing of workflows, and serves as a foundation for designing new workflows. The standardized directory structure (Figure 1) utilized by the workflow templates and their sample data is outlined in the Directory Structure section of systemPipeR's main vignette.

Figure 1: Directory structure ofsystemPipeR's workflows. For details, see here.

2 Getting started

2.1 Installation

The systemPipeRdata package is available at Bioconductor and can be installed from within R as follows.

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("systemPipeRdata")

2.2 Loading package and documentation

library("systemPipeRdata")  # Loads the package
library(help = "systemPipeRdata")  # Lists package info
vignette("systemPipeRdata")  # Opens vignette

3 Overview of workflow templates

An overview table of workflow templates, included in systemPipeRdata, can be returned as shown below. By clicking the URLs in the last column of the below workflow list, users can view the Rmd source file of a workflow, as well as the final HTML report generated after running a workflow on the provided test data. A list of the default data analysis steps included in each workflow is given here. Additional workflow templates are available on this project’s GitHub organization (for details, see below). To create an empty workflow template without any test data included, users want to choose the new template, which includes only the required directory structure and parameter files.

availableWF()
Name Description URL
new Generic Workflow Template Rmd, HTML
rnaseq RNA-Seq Workflow Template Rmd, HTML
riboseq RIBO-Seq Workflow Template Rmd, HTML
chipseq ChIP-Seq Workflow Template Rmd, HTML
varseq VAR-Seq Workflow Template Rmd, HTML
SPblast BLAST Workflow Template Rmd, HTML
SPcheminfo Cheminformatics Drug Similarity Template Rmd, HTML
SPscrna Basic Single-Cell Workflow Template Rmd, HTML

Table 1: Workflow templates

4 Use workflow templates

4.1 Load a workflow

The chosen example below uses the genWorkenvir function from the systemPipeRdata package to create an RNA-Seq workflow environment (selected under workflow="rnaseq") that is fully populated with a small test data set, including FASTQ files, reference genome and annotation data. The name of the resulting workflow directory can be specified under the mydirname argument. The default NULL uses the name of the chosen workflow. An error is issued if a directory of the same name and path exists already. After this, the user’s R session needs to be directed into the resulting rnaseq directory (here with setwd). The other workflow templates from the above table can be loaded the same way.

library(systemPipeRdata)
genWorkenvir(workflow = "rnaseq")
setwd("rnaseq")

On Linux and OS X systems the same can be achieved from the command-line of a terminal with the following commands.

$ Rscript -e "systemPipeRdata::genWorkenvir(workflow='rnaseq', mydirname='rnaseq')"
$ cd rnaseq

4.2 Run and visualize workflow

For running and working with systemPipeR workflows, users want to visit systemPipeR’s main vignette. The following gives only a very brief preview on how to run workflows, and create scientific and technical reports.

After a workflow environment (directory) has been created and the corresponding R session directed into the resulting directory (here rnaseq), the workflow can be loaded from the included R Markdown file (Rmd, here systemPipeRNAseq.Rmd). This template provides common data analysis steps that are typical for RNA-Seq workflows. Users have the options to add, remove, modify workflow steps by applying these changes to the sal workflow management container directly, or updating the Rmd file first and then updating sal accordingly.

library(systemPipeR)
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd", verbose = FALSE)

The default analysis steps of the imported RNA-Seq workflow are listed below. Users can modify the existing steps, add new ones or remove steps as needed.

Default analysis steps in RNA-Seq Workflow

  1. Read preprocessing
    • Quality filtering (trimming)
    • FASTQ quality report
  2. Alignments: HISAT2 (or any other RNA-Seq aligner)
  3. Alignment stats
  4. Read counting
  5. Sample-wise correlation analysis
  6. Analysis of differentially expressed genes (DEGs)
  7. GO term enrichment analysis
  8. Gene-wise clustering

Once the workflow has been loaded into sal, it can be executed from start to finish (or partially) with the runWF command. However, running the workflow will only be possible if all dependent CL software is installed on a user’s system. Their names and availability on a system can be listed with listCmdTools(sal, check_path=TRUE).

sal <- runWF(sal)

Workflows can be visualized as topology graphs using the plotWF function.

plotWF(sal)