Contents

1 Introduction

RNA-Seq is a revolutionary approach for discovering and investigating the transcriptome using next-generation sequencing technologies(Wang et al. 2009). Typically, this transcriptome analysis aims to identify genes differentially expressed across different conditions or tissues, resulting in an understanding of the important pathways that are associated with the conditions(Wang et al. 2009).

RNASeqR is an user-friendly R-based tool for running an RNA-Seq analysis pipeline including quality assessment, read alignment and quantification, differential expression analysis, and functional analysis. The main features of this package are an automated workflow, comprehensive reports with data visualization, and an extendable file structure. In this package, the new tuxedo pipeline published in Nature Protocols in 2016 can be fully implemented in the R environment, including extra functions such as reads quality assessment and functional analysis.

The following are the main tools that are used in this package: ‘HISAT2’ for read alignment(Kim et al. 2015); ‘StringTie’ for alignment assembly and transcripts quantification(Pertea et al. 2015); ‘Rsamtools’ for converting SAM files to BAM files(Morgan et al. 2018); ‘Gffcompare’ for comparing merged GTF files with reference GTF files; ‘systemPipeR’ for quality assessment(Backman et al. 2016); ‘ballgown’ (Fu et al. 2018), ‘DESeq2’ (Love et al. 2014) and ‘edgeR’ (Robinson et al. 2010;McCarthy et al. 2012) for finding potential differentially expressed genes; and ‘clusterProfiler’ (Yu et al. 2012) for Gene Ontology(GO) functional analysis and Kyoto Encyclopedia of Genes and Genomes(KEGG) pathway analysis.

The central concept behind this package is that each step involved in RNA-Seq data analysis is a function call in R. At the beginning, users will create a RNASeqRParam S4 object by running the RNASeqRParam() constructor function for all variable checking. After the creation of RNASeqRParam, it will be used as input to the following analysis functions.

  1. RNASeqEnvironmentSet_CMD() or RNASeqEnvironmentSet(): to setup the RNA-Seq environment.

  2. RNASeqQualityAssessment_CMD() or RNASeqQualityAssessment(): (optional) to run a quality assessment step.

  3. RNASeqReadProcess_CMD() or RNASeqReadProcess(): to run reads alignment and quantification.

  4. RNASeqDifferentialAnalysis_CMD() or RNASeqDifferentialAnalysis(): to run differential analysis via different R packages.

  5. RNASeqGoKegg_CMD() or RNASeqGoKegg(): to conduct GO and KEGG analysis.

Functions with the CMD suffix create an R script and run nohup R CMD BATCH script.R in the background. Functions with no CMD suffix process in the R shell. After running the above functions, the whole RNA-Seq analysis is done and the files generated in each step are stored in an organized file directory. The RNASeqR package makes two-group comparative RNA-Seq analysis more efficient and easier for users.

Functions with CMD suffix will create an R script and run nohup R CMD BATCH script.R in background while functions with no CMD suffix will be processed in R shell. Files generated in each step will be kept in proper directory. Once the workflow is completed, a comprehensive RNA-Seq analysis is done. Additionally, this package is mainly designed for a two-group comparison setting, i.e. differential expression profile between two conditions.

2 Sample Definition

Sample data used in this vignette can be downloaded from the RNASeqRData experiment package. It originated from NCBI’s Sequence Read Archive for the entries SRR3396381, SRR3396382, SRR3396384, SRR3396385, SRR3396386, and SRR3396387. These samples are from Saccharomyces cerevisiae. Suitable reference genome and gene annotation files for this species can be further downloaded from iGenomes, Ensembl, R64-1-1. To create a mini dataset for demonstration purposes, reads aligned to the region from 0 to 100000 on chromosome XV were extracted. The analysis of this mini dataset will be shown in this vignette. The experimental data package is located here.

For more real case-control data and RNA-Seq analysis results from this package, please go to this website: https://github.com/HowardChao/RNASeqR_analysis_result.

3 System Requirements

Necessary:

  1. R version >= 3.5.0

  2. Operating System: Linux and macOS are supported in the RNASeqR package. Windows is not supported. (because StringTie and HISAT2 are not available for Windows).

  3. Third-party software used in this package includes HISAT2, StringTie and Gffcompare. The availability of these commands will be checked by system2() through the R shell at the end of the ‘Environment Setup’ step. The environment must successfully built before running the following RNA-Seq analysis. By default, binaries will be installed based on the operating system of the workstation; therefore there is no additional compiling required. Alternatively, users can still decide to skip certain software binary installation. For more details, please refer to the ‘Environment Setup’ chapter.

Recommended:

  1. Python: Python2 or Python3.

  2. 2to3: If the Python version on the workstation is 3, this command will be used. Generally, 2to3 is available if Python3 is available.
    • Python and 2to3 are used for creating raw read counts for DESeq2 and edgeR.

    • The following are two conditions that will create a raw read count from StringTie output.
      1. Python2
      2. Python3 with 2to3 command available.
    • If one of these conditions is met, the raw read count will be created and DESeq2 and edgeR will be run automatically in the ‘Gene-level Differential Analyses’ step. If not, DESeq2 and edgeR will be skipped during ‘Gene-level Differential Analysis’ step. Checking the Python version and 2to3 command on the workstation beforehand is highly recommended but not necessary.

  3. HISAT2 indexex: Users are advised to provide an ‘indices/’ directory in ‘inputfiles/’. HISAT2 requires at least 160 GB of RAM and several hours to index the entire human genome.

4 Installation

if (!require("BiocManager")) {
    install.packages("BiocManager")
}
BiocManager::install("RNASeqRData")

5 “RNASeqRParam” S4 Object Creation

This is the first step of the RNA-Seq analysis workflow in this package. Prior to conducting RNA-Seq analysis, it is necessary to implement a constructor function, called RNASeqRParam() and to create a RNASeqRParam S4 object which stores parameters not only for pre-checking but also for utilizing as input the parameters in the subsequent steps.

5.1 RNASeqRParam Slots Explanation

There are 11 slots in RNASeqRParam:

  1. os.type : The operating system type. Value is linux or osx. This package only supports ‘Linux’ and ‘macOS’ (not ‘Windows’). If another operating system is detected, ERROR will be reported.

  2. python.variable : A Python-related variable. The value is a list of whether Python is available and the Python version (TRUE or FALSE, 2 or 3).

  3. python.2to3 : Availability of the 2to3 command. THe value is TRUE or FALSE.

  4. path.prefix : Path prefix of the ‘gene_data/’, ‘RNASeq_bin/’, ‘RNASeq_results/’, ‘Rscript/’ and ‘Rscript_out/’ directories. It is recommended that you create a new, empty directory in which all the subsequent RNA-Seq results can be saved.

  1. input.path.prefix : Path prefix of the ‘input_files/’ directory. Users should prepare an ‘input_file/’ directory with the following rules:
    • genome.name.fa: Reference genome in FASTA file formation.

    • genome.name.gtf: Gene annotation in GTF file formation.

    • raw_fastq.gz/: Directory storing FASTQ files.

      • Supports paired-end read files only.

      • Names of paired-end FASTQ files : ’sample.pattern_1.fastq.gz’ and ’sample.pattern_2.fastq.gz’. sample.pattern must be distinct for each sample.

    • phenodata.csv: Information about the RNA-Seq experiment design.

      • First column : Distinct ids for each sample. The value of each sample of this column must match sample.pattern in FASTQ files in ‘raw_fastq.gz/’. Column names must be ids.

      • Second column : An independent variable for the RNA-Seq experiment. Values of each sample of this column can only be parameter case.group or control.group. The column name is the parameter independent.variable.

    • indices/ : The directory storing HT2 index files for the HISAT2 alignment tool.

      • This directory is optional. HT2 index files corresponding to the reference genome can be installed at HISAT2 official website. Providing HT2 files can accelerate the subsequent analysis steps. It is highly advised that you install HT2 files.

      • If HT2 index files are not provided, the ‘input_files/indices/’ directory should be deleted.

  1. genome.name : The genome name defined in this RNA-Seq workflow (ex. ‘genome.name.fa’, ‘genome.name.gtf’)

  2. sample.pattern : A regular expression of paired-end fastq.gz files under ‘input_files/raw_fastq.gz/’. IMPORTANT!! The expression shouldn’t have _[1,2].fastq.gz at the end.

  3. independent.variable: Independent variable for the biological experiment design of a two-group RNA-Seq analysis workflow.

  4. case.group : Name of the case group.

  5. control.group : Name of the control group.

  6. indices.optional : A logical value indicating whether ‘input_files/indices/’ exists. Value is TRUE or FALSE

5.2 RNASeqRParam Constructor Checking

  1. Create a new directory for the RNA-Seq analysis. It is highly recommended to create a new, completely empty directory. The parameter path.prefix of RNASeqRParam() constructor should be the absolute path of this new directory. All the RNA-SeqR-related files are generated in the subsequent steps will be stored inside of this directory.

  2. Create a valid ‘input_files/’ directory. You should create a file directory named ‘input_files/’ with the neccessary files inside. It should follow the rules mentioned above.

  3. Run the constructor of the RNASeqRParam S4 object. This constructor will check the validity of the input parameters before creating the S4 objects.
    • Operating system

    • Python version

    • 2to3 command

    • Structure, content, and rules of ‘inputfiles/’

    • Validity of input parameters

5.3 Example

library(RNASeqR)
library(RNASeqRData)
input_files.path <- system.file("extdata/", package = "RNASeqRData")
rnaseq_result.path <- tempdir(check = TRUE)
list.files(input_files.path, recursive = TRUE)
##  [1] "input_files/Saccharomyces_cerevisiae_XV_Ensembl.fa" 
##  [2] "input_files/Saccharomyces_cerevisiae_XV_Ensembl.gtf"
##  [3] "input_files/phenodata.csv"                          
##  [4] "input_files/raw_fastq.gz/SRR3396381_XV_1.fastq.gz"  
##  [5] "input_files/raw_fastq.gz/SRR3396381_XV_2.fastq.gz"  
##  [6] "input_files/raw_fastq.gz/SRR3396382_XV_1.fastq.gz"  
##  [7] "input_files/raw_fastq.gz/SRR3396382_XV_2.fastq.gz"  
##  [8] "input_files/raw_fastq.gz/SRR3396384_XV_1.fastq.gz"  
##  [9] "input_files/raw_fastq.gz/SRR3396384_XV_2.fastq.gz"  
## [10] "input_files/raw_fastq.gz/SRR3396385_XV_1.fastq.gz"  
## [11] "input_files/raw_fastq.gz/SRR3396385_XV_2.fastq.gz"  
## [12] "input_files/raw_fastq.gz/SRR3396386_XV_1.fastq.gz"  
## [13] "input_files/raw_fastq.gz/SRR3396386_XV_2.fastq.gz"  
## [14] "input_files/raw_fastq.gz/SRR3396387_XV_1.fastq.gz"  
## [15] "input_files/raw_fastq.gz/SRR3396387_XV_2.fastq.gz"

Check the files in ‘inputfiles/’ directory.

exp <- RNASeqRParam(path.prefix = rnaseq_result.path, 
                    input.path.prefix = input_files.path, 
                    genome.name = "Saccharomyces_cerevisiae_XV_Ensembl", 
                    sample.pattern = "SRR[0-9]*_XV",
                    independent.variable = "state", 
                    case.group = "60mins_ID20_amphotericin_B", 
                    control.group = "60mins_ID20_control")
show(exp)
## RNASeqRParam S4 object
##               os.type : linux 
##       python.variable : (Availability: TRUE , Version: 2 )
##           python.2to3 : FALSE 
##           path.prefix : /tmp/RtmpbYbYQc/ 
##     input.path.prefix : /home/biocbuild/bbs-3.10-bioc/R/library/RNASeqRData/extdata/ 
##           genome.name : Saccharomyces_cerevisiae_XV_Ensembl 
##        sample.pattern : SRR[0-9]*_XV 
##  independent.variable : state 
##            case.group : 60mins_ID20_amphotericin_B 
##         control.group : 60mins_ID20_control 
##      indices.optional : FALSE 
##  independent.variable : state

In this example, the RNASeqRParam S4 object is stored in exp for subsequent RNA-Seq analysis steps. Any ERROR occurring in the checking steps will terminate the program.

6 Environment Setup

This is the second step of the RNA-Seq analysis workflow in this package. To set up the environment, run RNASeqEnvironmentSet_CMD() to execute the process in the background, or run RNASeqEnvironmentSet() to execute the process in the R shell.

6.1 File Setup

  1. Create Base Directories ‘gene_data/’, ‘RNASeq_bin/’, ‘RNASeq_results’, ‘Rscript’, and ‘Rscript_out’ will be created in the path.prefix directory. Here is the usage of these five main directories:
    • ‘gene_data/’: Symbolic links of ‘input_files/’ and files that are created in each step of RNA-Seq analysis will be stored in this directory.

    • ‘RNASeq_bin/’: The binaries of necessary tools, HISAT2, SAMtools, StringTie and Gffcompare, are installed in this directory.

    • ‘RNASeq_results’: The RNA-Seq results, for example, alignment results, quality assessment results, differential analysis results etc., will be stored in this directory.

    • ‘Rscript’: If your run the XXX_CMD() function, the corresponding R script(XXX.R) for certain steps will be created in this directory.

    • ‘Rscript_out’: The corresponding output report for R scripts (XXX.Rout) will be stored in this directory.

  2. Symbolic links will be created from files in ‘input_files/’ to the path.prefix directory.

6.2 Installation of Necessary Tools

The operating system of your workstation will be detected. If the operating system is not Linux or macOS, ERROR will be reported. Users can decide whether the installation of essential programs(HISAT2, StringTie and Gffcompare) is going be automatically processed.

Third-party softwares used in this package includes HISAT2, StringTie and Gffcompare. Binaries are available for these three programs, and by default, they will be installed automatically based on the operating system of the workstation. Zipped binaries will be unpacked and exported to the R environment PATH. No compilation is needed.

To specify, there are three parameters(install.hisat2, install.stringtie and install.gffcompare) in both the RNASeqEnvironmentSet_CMD() and RNASeqEnvironmentSet() functions for users to determine which software is going to be installed automatically or skipped. The default settings of these parameters are TRUE so that these three programs will be installed directly. Otherwise, users can skip certain software installation processes by turning the values to FALSE. Please make sure to check that the skipped programs are available using system2() through the R shell. Unavailability of any of the programs will cause failure in the ‘Environment Setup’ step.

Here is the version information of each software binary.

  • HISAT2
    • Based on your operating system, hisat2-2.1.0-Linux_x86_64.zip or hisat2-2.1.0-OSX_x86_64.zip zipped file will be installed.
    • The downloaded file will be unzipped and all binary files will be copied to and run from ‘RNASeq_bin/’
  • StringTie
    • Based on your operating system, stringtie-1.3.4d.Linux_x86_64.tar.gz or stringtie-1.3.4d.Linux_x86_64 will be installed.
    • The downloaded file will be unzipped and all binary files will be copied to and run from ‘RNASeq_bin/’.
  • Gffcompare
    • Based on your operating system, gffcompare-0.10.4.Linux_x86_64.tar.gz or gffcompare-0.10.4.Linux_x86_64.tar.gz will be installed.
    • The downloaded file will be unzipped and all binary files will be copied to and run from ‘RNASeq_bin/’.

6.3 Export Path

‘RNASeq_bin/’ will be added to the R environment PATH so that these binaries can be found in the R environment in the R shell through system2(). In the last step of environment setup, the hisat2 --version,stringtie --version,gffcompare --version, and samtools --version commands will be checked in order to make sure the environment is correctly constructed. The environment must be set up successfully before the subsequent analyses.

6.4 Example

Run RNASeqEnvironmentSet_CMD() or RNASeqEnvironmentSet().

  1. For runs in the background, the results will be reported in ‘Rscript_out/Environment_Set.Rout’. Make sure the environment is successfully set up before running the subsequent steps.
RNASeqEnvironmentSet_CMD(exp)
  1. For runs in the R shell, the results will be reported in the R shell. Make sure the environment is successfully setup before running the subsequent steps.
RNASeqEnvironmentSet(exp)

7 Quality Assessment of FASTQ sequence data

This is the third step of the RNA-Seq analysis workflow in this package. Different from other necessary steps, it is optional and can be run several times with each result stored separately. Although this step can be skipped, it is strongly recommended that it be performed before processing the alignment step. To evaluate the quality of the raw reads in the FASTQ files, run RNASeqQualityAssessment_CMD() in the background or run RNASeqQualityAssessment() to execute the process in the R shell.

7.1 “systemPipeR” Quality Assessment

In this step, the systemPipeR package is used for evaluating sequencing reads and the details are as follows:

  1. Check the number of times that the user has run the quality assessment process and create the corresponding files ‘RNASeq_results/QA_results/QA_{times}’.

  2. RNA-Seq environment set up. The ‘rnaseq/’ directory will be created by systemPipeR package.

  3. Create the ‘data.list.txt’ file.

  4. Reading FASTQ files and create ‘fastqReport.pdf’ containing the results of the quality assessment.

  5. Remove the ‘rnaseq/’ directory.

This quality assessment result (example below) is generated by systemPipeR package. It will be stored as a PDF.