systemPipeR 2.13.0
This workflow template is for analyzing RNA-Seq data. It is provided by
systemPipeRdata,
a companion package to systemPipeR (H Backman and Girke 2016).
Similar to other systemPipeR
workflow templates, a single command generates
the necessary working environment. This includes the expected directory
structure for executing systemPipeR
workflows and parameter files for running
command-line (CL) software utilized in specific analysis steps. For learning
and testing purposes, a small sample (toy) data set is also included (mainly
FASTQ and reference genome files). This enables users to seamlessly run the
numerous analysis steps of this workflow from start to finish without the
requirement of providing custom data. After testing the workflow, users have
the flexibility to employ the template as is with their own data or modify it
to suit their specific needs. For more comprehensive information on designing
and executing workflows, users want to refer to the main vignettes of
systemPipeR
and
systemPipeRdata.
The Rmd
file (systemPipeRNAseq.Rmd
) associated with this vignette serves a dual purpose. It acts
both as a template for executing the workflow and as a template for generating
a reproducible scientific analysis report. Thus, users want to customize the text
(and/or code) of this vignette to describe their experimental design and
analysis results. This typically involves deleting the instructions how to work
with this workflow, and customizing the text describing experimental designs,
other metadata and analysis results.
Typically, the user wants to describe here the sources and versions of the
reference genome sequence along with the corresponding annotations. The standard
directory structure of systemPipeR
(see here),
expects the input data in a subdirectory named data
and all results will be written to a separate results
directory. The Rmd source file
for executing the workflow and rendering its report (here systemPipeRNAseq.Rmd
) is
expected to be located in the parent directory.
The test (toy) data set used by this template (SRP010938) contains 18 paired-end (PE) read sets from Arabidposis thaliana (Howard et al. 2013). To minimize processing time during testing, each FASTQ file has been reduced to 90,000-100,000 randomly sampled PE reads that map to the first 100,000 nucleotides of each chromosome of the A. thaliana genome. The corresponding reference genome sequence (FASTA) and its GFF annotation files have been reduced to the same genome regions. This way the entire test sample data set is less than 200MB in storage space. A PE read set has been chosen here for flexibility, because it can be used for testing both types of analysis routines requiring either SE (single end) reads or PE reads.
To use their own RNA-Seq and reference genome data, users want to move or link the
data to the designated data
directory and execute the workflow from the parent directory
using their customized Rmd
file. Beginning with this template, users should delete the provided test
data and move or link their custom data to the designated locations.
Alternatively, users can create an environment skeleton (named new
here) or
build one from scratch. To perform an RNA-Seq analysis with new FASTQ files
from the same reference genome, users only need to provide the FASTQ files and
an experimental design file called ‘targets’ file that outlines the experimental
design. The structure and utility of targets files is described in systemPipeR's
main vignette here.
The default analysis steps included in this RNA-Seq workflow template are listed below. Users can modify the existing steps, add new ones or remove steps as needed.
Default analysis steps
HISAT2
(or any other RNA-Seq aligner)The environment for this RNA-Seq workflow is auto-generated below with the
genWorkenvir
function (selected under workflow="rnaseq"
). It is fully populated
with a small test data set, including FASTQ files, reference genome and annotation data. The name of the
resulting workflow directory can be specified under the mydirname
argument.
The default NULL
uses the name of the chosen workflow. An error is issued if
a directory of the same name and path exists already. After this, the user’s R
session needs to be directed into the resulting rnaseq
directory (here with
setwd
).
library(systemPipeRdata)
genWorkenvir(workflow = "rnaseq", mydirname = "rnaseq")
setwd("rnaseq")
targets
fileThe targets
file defines the input files (e.g. FASTQ or BAM) and sample
comparisons used in a data analysis workflow. It can also store any number of
additional descriptive information for each sample. The following shows the first
four lines of the targets
file used in this workflow template.
targetspath <- system.file("extdata", "targetsPE.txt", package = "systemPipeR")
targets <- read.delim(targetspath, comment.char = "#")
targets[1:4, -c(5, 6)]
## FileName1 FileName2
## 1 ./data/SRR446027_1.fastq.gz ./data/SRR446027_2.fastq.gz
## 2 ./data/SRR446028_1.fastq.gz ./data/SRR446028_2.fastq.gz
## 3 ./data/SRR446029_1.fastq.gz ./data/SRR446029_2.fastq.gz
## 4 ./data/SRR446030_1.fastq.gz ./data/SRR446030_2.fastq.gz
## SampleName Factor Date
## 1 M1A M1 23-Mar-2012
## 2 M1B M1 23-Mar-2012
## 3 A1A A1 23-Mar-2012
## 4 A1B A1 23-Mar-2012
To work with custom data, users need to generate a targets
file containing
the paths to their own FASTQ files. Here is a detailed description of the structure and
utility of targets
files.
After a workflow environment has been created with the above genWorkenvir
function call and the corresponding R session directed into the resulting directory (here rnaseq
),
the SPRproject
function is used to initialize a new workflow project instance. The latter
creates an empty SAL
workflow container (below sal
) and at the same time a
linked project log directory (default name .SPRproject
) that acts as a
flat-file database of a workflow. Additional details about this process and
the SAL workflow control class are provided in systemPipeR's
main vignette
here
and here.
Next, the importWF
function imports all the workflow steps outlined in the
source Rmd file of this vignette (here systemPipeRNAseq.Rmd
) into the SAL
workflow container.
An overview of the workflow steps and their status information can be returned
at any stage of the loading or run process by typing sal
.
library(systemPipeR)
sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd", verbose = FALSE)
sal
After loading the workflow into sal
, it can be executed from start to finish
(or partially) with the runWF
command. Running the workflow will only be
possible if all dependent CL software is installed on a user’s system. Their
names and availability on a system can be listed with listCmdTools(sal, check_path=TRUE)
. For more information about the runWF
command, refer to the
help file and the corresponding section in the main vignette
here.
Running workflows in parallel mode on computer clusters is a straightforward
process in systemPipeR
. Users can simply append the resource parameters (such
as the number of CPUs) for a cluster run to the sal
object after importing
the workflow steps with importWF
using the addResources
function. More
information about parallelization can be found in the corresponding section at
the end of this vignette here and in the main vignette
here.
sal <- runWF(sal)
Workflows can be visualized as topology graphs using the plotWF
function.
plotWF(sal)