1 Introduction to consensusDE

consensusDE aims to make differential expression (DE) analysis, with reporting of significance scores from multiple methods easy. It implements wrappers for Voom, DEseq2 and EdgeR and reports differential expression results seperately for each algorithm, as well as merging the results into a single table for determining consensus. The results of the merged table, are ordered by the summed ranks of the p-values for each algorithm and the intersect at minimum p-value threshols accross all methods is provided as the p_max (see below).

consensusDE is simplified into two functions:

  • buildSummarized() generate a summarized experiment that counts reads mapped (from bam files or htseq count files) against a transcriptome
  • multi_de_pairs() perform DE analysis (all possible pairwise comparisons)

Below, we demonstrate the core functionality of consensusDE as well as how to plot results using the diag_plots function.

2 consensusDE examples

Begin by first installing and then loading the consensusDE library. To illustrate functionality of consensusDE, we will utilise data from the airway and annotation libraries as follows. Begin by installing and attaching data from these libraries as follows:

3 Building a summarized experiment

A summarized experiment is an object that stores all relevant information for performing differential expression analysis. buildSummarized() allows users to build a summarized experiment object by simply providing 1) a table of bam/files (more below on format), 2) a directory of where to locate the bam/htseq files and 3) a transcript database to map the reads to (either a gtf file or txdb). Below we will use bam files attached to this package (from GenomicAlignments) as an example:

The minimum information is now ready to build a summarized experiment:

This will output a summarized object that has mapped the reads for the bam files that are listed in sample_table, located in bam_dir, against the transcript database provided: TxDb.Dmelanogaster.UCSC.dm3.ensGene. Bam file format, whether “paired” or “single” end (the type of sequencing technology used) must be specified using the read_format parameter. gtf formatted transcript databases can also be used instead of a txdb, by providing the full path to the gtf file using the gtf parameter. To save the summarized experiment externally, for future use, specify the path to save the summarized experiment using output_log To see details of all parameters see ?buildSummarized.

Overview of the summarized experiment:

## class: RangedSummarizedExperiment 
## dim: 15682 2 
## metadata(1): gene_coords
## assays(1): counts
## rownames(15682): FBgn0000003 FBgn0000008 ... FBgn0264726
##   FBgn0264727
## rowData names(0):
## colnames(2): sm_treated1.bam sm_untreated1.bam
## colData names(2): file group

3.1 Filtering low count data

buildSummarized() also allows users to filter out low read counts. This can be done when building the summarized experiment, or re-running with the summarized experiment output using buildSummarized(). See “Performing Differential Expresssion” below with filter example.

3.2 Building a tx_db object first

Sometimes it will be convenient to first build a txdb object and then pass this txdb object to buildSummarized using the tx_db parameter. This can be done as follows:

txdb <- makeTxDbFromGFF("/path/to/my.gtf", format="gtf", circ_seqs=character())

4 Performing Differential Expresssion

For differential expression (DE) analysis we will use the airway data for demonstration. See ?airway for more details for this experiment. NOTE: the summarized meta-data must include the columns “group” and “file” to build the correct models. For illustration, we sample 1000 genes from this dataset.

Running multi_de_pairs() will perform DE analysis on all possible pairs of “groups” and save these results as a simple list of “merged” - being the merged results of “deseq”, “voom” and “edger” into one table, as well as the latter three as objects independently. The data frame is sorted by the rank_sum. The following columns are now included:

  • ID - Identifier
  • AveExpr - Average Expression (taken from EdgeR)
  • LogFC - Log2 Fold-Change, also known as a log-ratio (taken from EdgeR)
  • edger_adj_p - EdgeR p-value adjusted for multiple hypotheses
  • deseq_adj_p - DESeq2 p-value adjusted for multiple hypotheses
  • voom_adj_p - Limma/voom p-value adjusted for multiple hypotheses
  • edger_rank - rank of the p-value obtained by EdgeR
  • deseq_rank - rank of the p-value obtained by DESeq2
  • voom_rank - rank of the p-value obtained by Limma/voom
  • rank_sum - sum of the ranks from edger_rank, voom_rank, rank_sum
  • p_max - the largest p-value observed from all methods tested.
    • This represents the intersect when a threshold is set on the p_max column

To access the merged results:

## [1] "untrt-trt"
##                ID  AveExpr     LogFC  edger_adj_p  deseq_adj_p
## 1 ENSG00000120129 10.90371 -2.939919 5.944412e-35 5.033205e-44
## 2 ENSG00000116584 10.88937  1.026538 4.135363e-16 1.448676e-40
## 3 ENSG00000077684 10.33310 -1.197176 8.043754e-16 3.646054e-26
## 4 ENSG00000177666 10.46821 -1.094580 3.967416e-14 5.220645e-25
## 5 ENSG00000103196 10.82936 -2.691813 1.187751e-17 1.448045e-21
## 6 ENSG00000139289 10.70412  1.003049 2.108502e-13 3.109897e-24
##     voom_adj_p edger_rank deseq_rank voom_rank rank_sum        p_max
## 1 0.0001129928          1          1       1.0      3.0 0.0001129928
## 2 0.0001195485          3          2       2.0      7.0 0.0001195485
## 3 0.0002112779          4          3       4.0     11.0 0.0002112779
## 4 0.0002112779          5          4       4.0     13.0 0.0002112779
## 5 0.0004213650          2          6       7.5     15.5 0.0004213650
## 6 0.0002112779          9          5       4.0     18.0 0.0002112779

4.1 Annotating DE results

Currently only ENSEMBL annotations are supported.

It is often useful to add additional annotated information to the output tables. This can be achieved by providing a database for annotations via ensembl_annotate. Annotations needs to be a Genome Wide Annotation object, e.g. org.Mm.eg.db for mouse or org.Hs.eg.db for human from BioConductor. For example, to install the database for the mouse annotation, go to http://bioconductor.org/packages/org.Mm.eg.db and follow the instructions. Ensure that after installing the database package that the library is loaded using library(org.Mm.eg.db). When running, “‘select()’ returned 1:many mapping between keys and columns” will appear on the command line. This is the result of multiple mapped transcript ID to Annotations. Only the first annotation is reported. See ?multi_de_pairs for additional documentation.

An example of annotating the above filtered airway data is provided below:

## Warning in buildSummarized(summarized = airway_filter, tx_db = EnsDb.Hsapiens.v86, : No output directory provided. The se file and sample_table will not
##           be saved
##                  ID  AveExpr     LogFC  edger_adj_p  deseq_adj_p
## 275 ENSG00000120129 10.90371 -2.939919 5.944412e-35 5.033205e-44
## 245 ENSG00000116584 10.88937  1.026538 4.135363e-16 1.448676e-40
## 90  ENSG00000077684 10.33310 -1.197176 8.043754e-16 3.646054e-26
## 656 ENSG00000177666 10.46821 -1.094580 3.967416e-14 5.220645e-25
## 149 ENSG00000103196 10.82936 -2.691813 1.187751e-17 1.448045e-21
## 395 ENSG00000139289 10.70412  1.003049 2.108502e-13 3.109897e-24
##       voom_adj_p edger_rank deseq_rank voom_rank rank_sum        p_max
## 275 0.0001129928          1          1       1.0      3.0 0.0001129928
## 245 0.0001195485          3          2       2.0      7.0 0.0001195485
## 90  0.0002112779          4          3       4.0     11.0 0.0002112779
## 656 0.0002112779          5          4       4.0     13.0 0.0002112779
## 149 0.0004213650          2          6       7.5     15.5 0.0004213650
## 395 0.0002112779          9          5       4.0     18.0 0.0002112779
##                                                     genename   symbol
## 275                           dual specificity phosphatase 1    DUSP1
## 245             Rho/Rac guanine nucleotide exchange factor 2  ARHGEF2
## 90                                  jade family PHD finger 1    JADE1
## 656           patatin like phospholipase domain containing 2   PNPLA2
## 149 cysteine rich secretory protein LCCL domain containing 2 CRISPLD2
## 395        pleckstrin homology like domain family A member 1   PHLDA1
##      kegg                   coords strand  width
## 275 04010 chr5:172768090-172771195      -   3106
## 245 05130 chr1:155946851-156007070      -  60220
## 90   <NA> chr4:128809623-128875224      +  65602
## 656  <NA>      chr11:818902-825573      +   6672
## 149  <NA>  chr16:84819984-84920768      + 100785
## 395  <NA>  chr12:76025447-76033932      -   8486

The following additional columns will now be present:

  • genename - extend gene names (e.g. alpha-L-fucosidase 2)
  • symbol - gene symbol (e.g. FUCA2)
  • kegg - kegg pathway identifier (e.g. 00511)

If metadata for the transcript database used to build the summarized experiment was included, the following annotations will also be included:

  • coords - chromosomal coordinates (e.g. chr6:143494811-143511690)
  • strand - strand transcript is on (i.e. + or -)
  • width - transcript width in base pairs (bp) (transcript start to end) (e.g. 16880 bp)

4.2 Writing tables to an output directory

multi_de_pairs provides options to automatically write all results to output directories when a full path is provided. Which results are output depends on which directories are provided. Full paths provided to the parameters of output_voom, output_edger, output_deseq and output_combined will output Voom, EdgeR, DEseq and the merged results to the directories provided, respectively.

5 Removing unwanted variation (RUV)

consensusDE also provides the option to remove batch effects through RUVseq functionality. consensusDE currently implements RUVr which models a first pass generalised linear model (GLM) using EdgeR and obtaining residuals for incorporation into the SummarizedExperiment object for inclusion in the models for DE analysis. The following example, uses RUV to identify these residuals. To view the residuals in the model see the resisuals section below in the plotting functions. Note, that if ruv_correct = TRUE and a path to a plot_dir is provided, diagnostic plots before and after RUV correction will be produced. The residuals can also be accessed in the summarizedExperiment as below. These are present in the “W_1” column. At present only one factor of variation is determined.

##            SampleName    cell   dex albut        Run avgLength Experiment
## SRR1039508 GSM1275862  N61311 untrt untrt SRR1039508       126  SRX384345
## SRR1039509 GSM1275863  N61311   trt untrt SRR1039509       126  SRX384346
## SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512       126  SRX384349
## SRR1039513 GSM1275867 N052611   trt untrt SRR1039513        87  SRX384350
## SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516       120  SRX384353
## SRR1039517 GSM1275871 N080611   trt untrt SRR1039517       126  SRX384354
## SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520       101  SRX384357
## SRR1039521 GSM1275875 N061011   trt untrt SRR1039521        98  SRX384358
##               Sample    BioSample group       file         W_1
## SRR1039508 SRS508568 SAMN02422669 untrt SRR1039508 -0.10998592
## SRR1039509 SRS508567 SAMN02422675   trt SRR1039509 -0.09591722
## SRR1039512 SRS508571 SAMN02422678 untrt SRR1039512 -0.08654073
## SRR1039513 SRS508572 SAMN02422670   trt SRR1039513 -0.22748616
## SRR1039516 SRS508575 SAMN02422682 untrt SRR1039516  0.50010306
## SRR1039517 SRS508576 SAMN02422673   trt SRR1039517  0.67242765
## SRR1039520 SRS508579 SAMN02422683 untrt SRR1039520 -0.28015978
## SRR1039521 SRS508580 SAMN02422677   trt SRR1039521 -0.37244089
##                ID  AveExpr     LogFC  edger_adj_p  deseq_adj_p
## 1 ENSG00000120129 10.90371 -2.911622 4.273352e-50 2.002719e-68
## 2 ENSG00000103196 10.82936 -2.601881 6.670591e-31 7.335488e-40
## 3 ENSG00000077684 10.33310 -1.186168 1.163229e-19 7.883643e-39
## 4 ENSG00000116584 10.88937  1.026711 4.163900e-16 1.103078e-31
## 5 ENSG00000181061 10.95372 -1.190232 5.426888e-15 7.241683e-21
## 6 ENSG00000177666 10.46821 -1.092451 1.674465e-14 1.622165e-22
##     voom_adj_p edger_rank deseq_rank voom_rank rank_sum        p_max
## 1 5.712514e-05          1          1       1.0      3.0 5.712514e-05
## 2 1.211610e-04          2          2       3.5      7.5 1.211610e-04
## 3 1.085488e-04          3          3       2.0      8.0 1.085488e-04
## 4 1.211610e-04          5          4       3.5     12.5 1.211610e-04
## 5 3.139556e-04          6          7       6.0     19.0 3.139556e-04
## 6 3.139556e-04          8          6       6.0     20.0 3.139556e-04

6 DE analysis with paired samples

multi_de_pairs supports DE with paired samples. Paired samples may include, for example, the same patient observed before and after a treatment. For demonstration purposes, we assume that each untreated and treated sample is a pair.

NB. paired analysis with more than two groups is not currently supported. If there are more than two groups, consider testing each of the groups and their pairs seperately, or see the edgeR, limma/voom or DESeq2 vignettes for establishing a multi-variate model with blocking factors.

First we will update the summarized experiment object to include a “pairs” column and set paired = "paired" in multi_de_pairs.

##                ID   AveExpr     LogFC   edger_adj_p   deseq_adj_p
## 1 ENSG00000211445 12.332857 -3.711412 7.195857e-229 5.612481e-249
## 2 ENSG00000120129 10.903707 -2.934433 3.127803e-161 2.170183e-143
## 3 ENSG00000103196 10.829358 -2.609439 2.307845e-144 6.247140e-121
## 4 ENSG00000253368  9.516104 -1.976023  2.913325e-94  3.191399e-57
## 5 ENSG00000137672  8.861033 -1.943783  1.701067e-88  1.402861e-47
## 6 ENSG00000068383  9.899344 -1.369965  4.836484e-54  1.790181e-33
##     voom_adj_p edger_rank deseq_rank voom_rank rank_sum        p_max
## 1 4.707166e-12          1          1       1.0      3.0 4.707166e-12
## 2 6.556469e-11          2          2       2.0      6.0 6.556469e-11
## 3 8.761573e-11          3          3       3.0      9.0 8.761573e-11
## 4 1.398148e-09          4          4       4.0     12.0 1.398148e-09
## 5 1.441460e-09          5          5       5.0     15.0 1.441460e-09
## 6 2.522943e-08          7          8       6.5     21.5 2.522943e-08

The design matrix can be retrieved as follows (from e.g. the voom model fit)

##            Intercept        W_1 pair2 pair3 pair4 untrt
## SRR1039508         1 -0.2193518     0     0     0     1
## SRR1039509         1  0.2085335     0     0     0     0
## SRR1039512         1 -0.2642036     1     0     0     1
## SRR1039513         1  0.2728930     1     0     0     0
## SRR1039516         1  0.6118941     0     1     0     1
## SRR1039517         1 -0.5928095     0     1     0     0
## SRR1039520         1 -0.1465777     0     0     1     1
## SRR1039521         1  0.1296221     0     0     1     0
## attr(,"assign")
## [1] 0 1 2 2 2 3
## attr(,"contrasts")
## attr(,"contrasts")$pairs
## [1] "contr.treatment"
## 
## attr(,"contrasts")$group
## [1] "contr.treatment"

7 Plotting functions

When performing DE analysis, a series of plots (currently 10) can be generated and saved as .pdf files in a plot directory provided to multi_de_pairs() with the parameter: plot_dir = "/path/to/save/pdfs/. See ?multi_de_pairs for description.

In addition, each of the 10 plots can be plotted individually using the diag_plots function. See ?diag_plots for description, which provides wrappers for 10 different plots. Next we will plot each of these using the example data.

7.1 Mapped reads

Plot the number of reads that mapped to the transcriptome of each sample. The sample numbers on the x-axis correspond to the sample row number in the summarizedExperiment built, accessible using colData(airway). Samples are coloured by their “group”.

7.4 RUV residuals

Residuals for the RUV model can be plotted as follows: