Note: the most recent version of this tutorial can be found here.

Note: if you use systemPipeR in published research, please cite: Backman, T.W.H and Girke, T. (2016). systemPipeR: NGS Workflow and Report Generation Environment. BMC Bioinformatics, 17: 388. 10.1186/s12859-016-1241-0.

1 Workflow templates

The intended way of running systemPipeR workflows is via *.Rmd files, which can be executed either line-wise in interactive mode or with a single command from R or the command-line. This way comprehensive and reproducible analysis reports can be generated in PDF or HTML format in a fully automated manner by making use of the highly functional reporting utilities available for R.

Templates for setting up custom project reports are provided as *.Rmd files by the helper package systemPipeRdata and in the vignettes subdirectory of systemPipeR. The corresponding HTML of these report templates are available here: systemPipeRNAseq, systemPipeRIBOseq, systemPipeChIPseq and systemPipeVARseq. To work with *.Rmd files efficiently, basic knowledge of knitr and Latex or R Markdown v2 is required.

1.1 Directory Structure

*systemPipeR's* preconfigured directory structure.

Figure 1: systemPipeR’s preconfigured directory structure

The working environment of the sample data loaded in the previous step contains the following pre-configured directory structure. Directory names are indicated in green. Users can change this structure as needed, but need to adjust the code in their workflows accordingly.

  • workflow/ (e.g. rnaseq/)
    • This is the root directory of the R session running the workflow.
    • Run script ( *.Rmd) and sample annotation (targets.txt) files are located here.
    • Note, this directory can have any name (e.g. rnaseq, varseq). Changing its name does not require any modifications in the run script(s).
    • Important subdirectories:
      • param/
        • Stores non-CWL parameter files such as: *.param, *.tmpl and *.run.sh. These files are only required for backwards compatibility to run old workflows using the previous custom command-line interface.
        • param/cwl/: This subdirectory stores all the CWL parameter files. To organize workflows, each can have its own subdirectory, where all CWL param and input.yml files need to be in the same subdirectory.
      • data/
        • FASTQ files
        • FASTA file of reference (e.g. reference genome)
        • Annotation files
        • etc.
      • results/
        • Analysis results are usually written to this directory, including: alignment, variant and peak files (BAM, VCF, BED); tabular result files; and image/plot files
        • Note, the user has the option to organize results files for a given sample and analysis step in a separate subdirectory.

The following parameter files are included in each workflow template:

  1. targets.txt: initial one provided by user; downstream targets_*.txt files are generated automatically
  2. *.param/cwl: defines parameter for input/output file operations, e.g.:
    • hisat2-se/hisat2-mapping-se.cwl
    • hisat2-se/hisat2-mapping-se.yml
  3. *_run.sh: optional bash scripts
  4. Configuration files for computer cluster environments (skip on single machines):
    • .batchtools.conf.R: defines the type of scheduler for batchtools pointing to template file of cluster, and located in user’s home directory
    • *.tmpl: specifies parameters of scheduler used by a system, e.g. Torque, SGE, Slurm, etc.

2 RNA-Seq Workflow

This workflow demonstrates how to use various utilities for building and running automated end-to-end analysis workflows for RNA-Seq data.

The full workflow can be found here: HTML, .Rmd, and .R.

2.1 Loading package and workflow template

Load the RNA-Seq sample workflow into your current working directory.

library(systemPipeRdata)
genWorkenvir(workflow = "rnaseq")
setwd("rnaseq")

2.2 Create the workflow

This template provides some common steps for a RNAseq workflow. One can add, remove, modify workflow steps by operating on the sal object.

sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRNAseq.Rmd", verbose = FALSE)

Workflow includes following steps:

  1. Read preprocessing
    • Quality filtering (trimming)
    • FASTQ quality report
  2. Alignments: HISAT2 (or any other RNA-Seq aligner)
  3. Alignment stats
  4. Read counting
  5. Sample-wise correlation analysis
  6. Analysis of differentially expressed genes (DEGs)
  7. GO term enrichment analysis
  8. Gene-wise clustering

2.3 Run workflow

sal <- runWF(sal)

2.4 Workflow visualization

plotWF(sal)

2.5 Report generation

sal <- renderReport(sal)
sal <- renderLogs(sal)

3 ChIP-Seq Workflow

This workflow demonstrates how to use various utilities for building and running automated end-to-end analysis workflows for ChIP-Seq data.

The full workflow can be found here: HTML, .Rmd, and .R.

3.1 Loading package and workflow template

Load the ChIP-Seq sample workflow into your current working directory.

library(systemPipeRdata)
genWorkenvir(workflow = "chipseq")
setwd("chipseq")

Workflow includes following steps:

  1. Read preprocessing
    • Quality filtering (trimming)
    • FASTQ quality report
  2. Alignments: Bowtie2 or rsubread
  3. Alignment stats
  4. Peak calling: MACS2
  5. Peak annotation with genomic context
  6. Differential binding analysis
  7. GO term enrichment analysis
  8. Motif analysis

3.2 Create the workflow

This template provides some common steps for a ChIPseq workflow. One can add, remove, modify workflow steps by operating on the sal object.

sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeChIPseq.Rmd", verbose = FALSE)

3.3 Run workflow

sal <- runWF(sal)

3.4 Workflow visualization

plotWF(sal)

3.5 Report generation

sal <- renderReport(sal)
sal <- renderLogs(sal)

4 VAR-Seq Workflow

This workflow demonstrates how to use various utilities for building and running automated end-to-end analysis workflows for VAR-Seq data.

The full workflow can be found here: HTML, .Rmd, and .R.

4.1 Loading package and workflow template

Load the VAR-Seq sample workflow into your current working directory.

library(systemPipeRdata)
genWorkenvir(workflow = "varseq")
setwd("varseq")

Workflow includes following steps:

  1. Read preprocessing
    • Quality filtering (trimming)
    • FASTQ quality report
  2. Alignments: gsnap, bwa
  3. Variant calling: VariantTools, GATK, BCFtools
  4. Variant filtering: VariantTools and VariantAnnotation
  5. Variant annotation: VariantAnnotation
  6. Combine results from many samples
  7. Summary statistics of samples

4.2 Create the workflow

This template provides some common steps for a VARseq workflow. One can add, remove, modify workflow steps by operating on the sal object.

sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeVARseq.Rmd", verbose = FALSE)

4.3 Run workflow

sal <- runWF(sal)

4.4 Workflow visualization

plotWF(sal)

4.5 Report generation

sal <- renderReport(sal)
sal <- renderLogs(sal)

5 Ribo-Seq Workflow

This workflow demonstrates how to use various utilities for building and running automated end-to-end analysis workflows for RIBO-Seq data.

The full workflow can be found here: HTML, .Rmd, and .R.

5.1 Loading package and workflow template

Load the RIBO-Seq sample workflow into your current working directory.

library(systemPipeRdata)
genWorkenvir(workflow = "riboseq")
setwd("riboseq")

Workflow includes following steps:

  1. Read preprocessing
    • Adaptor trimming and quality filtering
    • FASTQ quality report
  2. Alignments: HISAT2 (or any other RNA-Seq aligner)
  3. Alignment stats
  4. Compute read distribution across genomic features
  5. Adding custom features to workflow (e.g. uORFs)
  6. Genomic read coverage along transcripts
  7. Read counting
  8. Sample-wise correlation analysis
  9. Analysis of differentially expressed genes (DEGs)
  10. GO term enrichment analysis
  11. Gene-wise clustering
  12. Differential ribosome binding (translational efficiency)

This template provides some common steps for a RIBOseq workflow. One can add, remove, modify workflow steps by operating on the sal object.

sal <- SPRproject()
sal <- importWF(sal, file_path = "systemPipeRIBOseq.Rmd", verbose = FALSE)

5.2 Run workflow

sal <- runWF(sal)

5.3 Workflow visualization

plotWF(sal)

5.4 Report generation

sal <- renderReport(sal)
sal <- renderLogs(sal)

6 Version information

sessionInfo()
## R Under development (unstable) (2024-03-18 r86148)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] magrittr_2.0.3              systemPipeR_2.9.1          
##  [3] ShortRead_1.61.2            GenomicAlignments_1.39.4   
##  [5] SummarizedExperiment_1.33.3 Biobase_2.63.0             
##  [7] MatrixGenerics_1.15.0       matrixStats_1.2.0          
##  [9] BiocParallel_1.37.1         Rsamtools_2.19.4           
## [11] Biostrings_2.71.4           XVector_0.43.1             
## [13] GenomicRanges_1.55.4        GenomeInfoDb_1.39.9        
## [15] IRanges_2.37.1              S4Vectors_0.41.5           
## [17] BiocGenerics_0.49.1         BiocStyle_2.31.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        viridisLite_0.4.2       dplyr_1.1.4            
##  [4] farver_2.1.1            bitops_1.0-7            fastmap_1.1.1          
##  [7] digest_0.6.35           lifecycle_1.0.4         ellipsis_0.3.2         
## [10] compiler_4.4.0          rlang_1.1.3             sass_0.4.9             
## [13] tools_4.4.0             utf8_1.2.4              yaml_2.3.8             
## [16] systemPipeRdata_2.7.0   knitr_1.45              S4Arrays_1.3.6         
## [19] labeling_0.4.3          htmlwidgets_1.6.4       interp_1.1-6           
## [22] DelayedArray_0.29.9     xml2_1.3.6              RColorBrewer_1.1-3     
## [25] abind_1.4-5             withr_3.0.0             hwriter_1.3.2.1        
## [28] grid_4.4.0              fansi_1.0.6             latticeExtra_0.6-30    
## [31] colorspace_2.1-0        ggplot2_3.5.0           scales_1.3.0           
## [34] cli_3.6.2               rmarkdown_2.26          crayon_1.5.2           
## [37] generics_0.1.3          remotes_2.5.0           rstudioapi_0.15.0      
## [40] cachem_1.0.8            stringr_1.5.1           zlibbioc_1.49.3        
## [43] parallel_4.4.0          formatR_1.14            BiocManager_1.30.22    
## [46] vctrs_0.6.5             Matrix_1.6-5            jsonlite_1.8.8         
## [49] bookdown_0.38           systemfonts_1.0.6       jpeg_0.1-10            
## [52] magick_2.8.3            crosstalk_1.2.1         jquerylib_0.1.4        
## [55] glue_1.7.0              codetools_0.2-19        DT_0.32                
## [58] stringi_1.8.3           gtable_0.3.4            deldir_2.0-4           
## [61] munsell_0.5.0           tibble_3.2.1            pillar_1.9.0           
## [64] htmltools_0.5.7         GenomeInfoDbData_1.2.11 R6_2.5.1               
## [67] evaluate_0.23           kableExtra_1.4.0        lattice_0.22-6         
## [70] highr_0.10              png_0.1-8               bslib_0.6.1            
## [73] Rcpp_1.0.12             svglite_2.1.3           SparseArray_1.3.4      
## [76] xfun_0.42               pkgconfig_2.0.3

7 Funding

This project is funded by NSF award ABI-1661152.

8 References