1 Introduction

1.1 Overview

This workflow template is for analyzing RNA-Seq data. It is provided by systemPipeRdata, a companion package to systemPipeR (H Backman and Girke 2016). Similar to other systemPipeR workflow templates, a single command generates the necessary working environment. This includes the expected directory structure for executing systemPipeR workflows and parameter files for running command-line (CL) software utilized in specific analysis steps. For learning and testing purposes, a small sample (toy) data set is also included (mainly FASTQ and reference genome files). This enables users to seamlessly run the numerous analysis steps of this workflow from start to finish without the requirement of providing custom data. After testing the workflow, users have the flexibility to employ the template as is with their own data or modify it to suit their specific needs. For more comprehensive information on designing and executing workflows, users want to refer to the main vignettes of systemPipeR and systemPipeRdata.

The Rmd file (systemPipeRNAseq.Rmd) associated with this vignette serves a dual purpose. It acts both as a template for executing the workflow and as a template for generating a reproducible scientific analysis report. Thus, users want to customize the text (and/or code) of this vignette to describe their experimental design and analysis results. This typically involves deleting the instructions how to work with this workflow, and customizing the text describing experimental designs, other metadata and analysis results.

1.2 Experimental design

Typically, the user wants to describe here the sources and versions of the reference genome sequence along with the corresponding annotations. The standard directory structure of systemPipeR (see here), expects the input data in a subdirectory named data and all results will be written to a separate results directory. The Rmd source file for executing the workflow and rendering its report (here systemPipeRNAseq.Rmd) is expected to be located in the parent directory.

The test (toy) data set used by this template (SRP010938) contains 18 paired-end (PE) read sets from Arabidposis thaliana (Howard et al. 2013). To minimize processing time during testing, each FASTQ file has been reduced to 90,000-100,000 randomly sampled PE reads that map to the first 100,000 nucleotides of each chromosome of the A. thaliana genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files have been reduced to the same genome regions. This way the entire test sample data set is less than 200MB in storage space. A PE read set has been chosen here for flexibility, because it can be used for testing both types of analysis routines requiring either SE (single end) reads or PE reads.

To use their own RNA-Seq and reference genome data, users want to move or link the data to the designated data directory and execute the workflow from the parent directory using their customized Rmd file. Beginning with this template, users should delete the provided test data and move or link their custom data to the designated locations. Alternatively, users can create an environment skeleton (named new here) or build one from scratch. To perform an RNA-Seq analysis with new FASTQ files from the same reference genome, users only need to provide the FASTQ files and an experimental design file called ‘targets’ file that outlines the experimental design. The structure and utility of targets files is described in systemPipeR's main vignette here.

1.3 Workflow steps

The default analysis steps included in this RNA-Seq workflow template are listed below. Users can modify the existing steps, add new ones or remove steps as needed.

Default analysis steps

  1. Read preprocessing
    • Quality filtering (trimming)
    • FASTQ quality report
  2. Alignments: HISAT2 (or any other RNA-Seq aligner)
  3. Alignment stats
  4. Read counting
  5. Sample-wise correlation analysis
  6. Analysis of differentially expressed genes (DEGs)
  7. GO term enrichment analysis
  8. Gene-wise clustering

1.4 Load workflow environment

The environment for this RNA-Seq workflow is auto-generated below with the genWorkenvir function (selected under workflow="rnaseq"). It is fully populated with a small test data set, including FASTQ files, reference genome and annotation data. The name of the resulting workflow directory can be specified under the mydirname argument. The default NULL uses the name of the chosen workflow. An error is issued if a directory of the same name and path exists already. After this, the user’s R session needs to be directed into the resulting rnaseq directory (here with setwd).

library(systemPipeRdata)
genWorkenvir(workflow = "rnaseq", mydirname = "rnaseq")
setwd("rnaseq")

1.4.1 Input data: targets file

The targets file defines the input files (e.g. FASTQ or BAM) and sample comparisons used in a data analysis workflow. It can also store any number of additional descriptive information for each sample. The following shows the first four lines of the targets file used in this workflow template.

targetspath <- system.file("extdata", "targetsPE.txt", package = "systemPipeR")
targets <- read.delim(targetspath, comment.char = "#")
targets[1:4, -c(5, 6)]
##                     FileName1                   FileName2
## 1 ./data/SRR446027_1.fastq.gz ./data/SRR446027_2.fastq.gz
## 2 ./data/SRR446028_1.fastq.gz ./data/SRR446028_2.fastq.gz
## 3 ./data/SRR446029_1.fastq.gz ./data/SRR446029_2.fastq.gz
## 4 ./data/SRR446030_1.fastq.gz ./data/SRR446030_2.fastq.gz
##   SampleName Factor        Date
## 1        M1A     M1 23-Mar-2012
## 2        M1B     M1 23-Mar-2012
## 3        A1A     A1 23-Mar-2012
## 4        A1B     A1 23-Mar-2012

To work with custom data, users need to generate a targets file containing the paths to their own FASTQ files. Here is a detailed description of the structure and utility of targets files.

2 Quick start

After a workflow environment has been created with the above genWorkenvir function call and the corresponding R session directed into the resulting directory (here rnaseq), the SPRproject function is used to initialize a new workflow project instance. The latter creates an empty SAL workflow container (below sal) and at the same time a linked project log directory (default name .SPRproject) that acts as a flat-file database of a workflow. Additional details about this process and the SAL workflow control class are provided in systemPipeR's main vignette here and here.

Next, the importWF function imports all the workflow steps outlined in the source Rmd file of this vignette (here systemPipeRNAseq.Rmd) into the SAL workflow container. An overview of the workflow steps and their status information can be returned at any stage of the loading or run process by typing sal.

library(systemPipeR)
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd", verbose = FALSE)
sal

After loading the workflow into sal, it can be executed from start to finish (or partially) with the runWF command. Running the workflow will only be possible if all dependent CL software is installed on a user’s system. Their names and availability on a system can be listed with listCmdTools(sal, check_path=TRUE). For more information about the runWF command, refer to the help file and the corresponding section in the main vignette here.

Running workflows in parallel mode on computer clusters is a straightforward process in systemPipeR. Users can simply append the resource parameters (such as the number of CPUs) for a cluster run to the sal object after importing the workflow steps with importWF using the addResources function. More information about parallelization can be found in the corresponding section at the end of this vignette here and in the main vignette here.

sal <- runWF(sal)

Workflows can be visualized as topology graphs using the plotWF function.

plotWF(sal)