scPipe: a flexible data preprocessing pipeline for 3’ end scRNA-seq data

Luyi Tian

2018-04-30

Introduction

scPipe is a package designed to process single-cell RNA-sequencing (scRNA-seq) data generated by different protocols. It is designed for protocols with UMI, but can also adapt to non-UMI protocols. scPipe consist of two major components. The first is data preprocessing with raw fastq as input and a gene count matrix as output. The second component starts from the gene count matrix, includes quality control and a shiny app for clustering and visualization.

Case Study: Preprocessing CEL-seq2 data

CEL-seq2 is a modified version of the CEL-seq method with higher sensitivity and lower cost. It uses a polyT primer with index sequences to capture a cell’s mRNA. The index sequence consists of the a cell specific index, which is designed, and a molecular index, also called unique molecular identifier (UMI), which is a random sequence that should be different for each primer (the number of distinct UMIs will depend on its length). The ouput fastq files from a CEL-seq2 experiment is paired-ended where the first read contains the transcript and second read contains the cellindex+UMI with some polyA sequence.

Getting started

We begin by creating a folder to store the results.

## Loading required package: ggplot2
## Loading required package: SingleCellExperiment
## Loading required package: SummarizedExperiment
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
##     as.data.frame, basename, cbind, colMeans, colSums, colnames,
##     dirname, do.call, duplicated, eval, evalq, get, grep, grepl,
##     intersect, is.unsorted, lapply, lengths, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, rank, rbind,
##     rowMeans, rowSums, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which, which.max, which.min
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: DelayedArray
## Loading required package: matrixStats
## 
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## Loading required package: BiocParallel
## 
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
## 
##     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following objects are masked from 'package:base':
## 
##     aperm, apply
## 
## 

Organising the files required

To process the data, we need the genome fasta file, gff3 exon annotation and a cell barcode annotation. The barcode annotation should be a .csv file with at least two columns, where the first column has the cell id and the second column contains the barcode sequence. We use ERCC spike-in genes for this demo. All files can be found in the extdata folder of the scPipe package:

The read structure for this example is paired-ended, with the first longer read mapping to transcripts and second shorter read consisting of 6bp UMIs followed by 8bp cell barcodes. NOTE: by convention, scPipe always assumes read1 refers to the read with the transcript sequence, which is usually the longer read. These data are also available in the in extdata folder:

Data Preprocessing

Fastq reformatting

The pipeline starts with fastq file reformatting. We move the barcode and UMI sequences to the read name and leave the transcript sequence as is. This outputs a read name that looks like @[barcode_sequence]*[UMI_sequence]#[readname] … The read structure of read 2 in our example dataset has the 8 bp long cell barcode starting at position 6 and the 6 bp long UMI sequence starting at the first position. So the read structure will be : list(bs1=-1, bl1=0, bs2=6, bl2=8, us=0, ul=6). bs1=-1, bl1=0 means we don’t have an index in read 1 so we set a negative value as its start position and give it zero length. bs2=6, bl2=8 means we have an index in read two which starts at position 6 in the read and is 8 bases in length. us=0, ul=6 means we have a UMI at the start of read 2 which is 6 bases long. NOTE: we use a zero based index system, so the indexing of the sequence starts at zero.

## trimming fastq file...
## pass QC: 46009
## removed_have_N: 0
## removed_low_qual: 0
## time elapsed: 73 milliseconds

Aligning reads to a reference genome

Next we align reads to the genome. This example uses Rsubread but any aligner that support RNA-seq alignment and gives standard BAM output can be used here. The Rsubread package is not available on Windows, so the following alignment step will only run on Mac OSX or Linux.

## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 1.30.0
## 
## //=========================== indexBuilder setting ===========================\\
## ||                                                                            ||
## ||                Index name : ERCC_index                                     ||
## ||               Index space : base-space                                     ||
## ||                    Memory : 8000 Mbytes                                    ||
## ||          Repeat threshold : 100 repeats                                    ||
## ||  Distance to next subread : 3                                              ||
## ||                                                                            ||
## ||               Input files : 1 file in total                                ||
## ||                             o ERCC92.fa                                    ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Check the integrity of provided reference sequences ...                    ||
## || There were 1 notes for reference sequences.                                ||
## || The notes can be found in the log file, '/tmp/RtmpIuT6oZ/ERCC_index.log'.  ||
## || Scan uninformative subreads in reference sequences ...                     ||
## ||    8%,   0 mins elapsed, rate=111.4k bps/s, total=0m                       ||
## ||   16%,   0 mins elapsed, rate=203.4k bps/s, total=0m                       ||
## ||   24%,   0 mins elapsed, rate=284.4k bps/s, total=0m                       ||
## ||   33%,   0 mins elapsed, rate=355.2k bps/s, total=0m                       ||
## ||   41%,   0 mins elapsed, rate=417.6k bps/s, total=0m                       ||
## ||   49%,   0 mins elapsed, rate=473.0k bps/s, total=0m                       ||
## ||   58%,   0 mins elapsed, rate=522.5k bps/s, total=0m                       ||
## ||   66%,   0 mins elapsed, rate=566.9k bps/s, total=0m                       ||
## ||   74%,   0 mins elapsed, rate=601.4k bps/s, total=0m                       ||
## ||   83%,   0 mins elapsed, rate=643.7k bps/s, total=0m                       ||
## ||   91%,   0 mins elapsed, rate=677.0k bps/s, total=0m                       ||
## || 1 uninformative subreads were found.                                       ||
## || These subreads were excluded from index building.                          ||
## || Build the index...                                                         ||
## ||   24%,   0 mins elapsed, rate=3508.0k bps/s, total=0m                      ||
## ||   49%,   0 mins elapsed, rate=3238.2k bps/s, total=0m                      ||
## ||  149%,   0 mins elapsed, rate=3413.2k bps/s, total=0m                      ||
## ||  183%,   0 mins elapsed, rate=3589.6k bps/s, total=0m                      ||
## ||  324%,   0 mins elapsed, rate=3853.9k bps/s, total=0m                      ||
## || Save current index block...                                                ||
## ||  [ 0.0% finished ]                                                         ||
## ||  [ 10.0% finished ]                                                        ||
## ||  [ 20.0% finished ]                                                        ||
## ||  [ 30.0% finished ]                                                        ||
## ||  [ 40.0% finished ]                                                        ||
## ||  [ 50.0% finished ]                                                        ||
## ||  [ 60.0% finished ]                                                        ||
## ||  [ 70.0% finished ]                                                        ||
## ||  [ 80.0% finished ]                                                        ||
## ||  [ 90.0% finished ]                                                        ||
## ||  [ 100.0% finished ]                                                       ||
## ||                                                                            ||
## ||                      Total running time: 0.0 minutes.                      ||
## ||          Index /tmp/RtmpIuT6oZ/ERCC_index was successfully built!          ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
## 
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 1.30.0
## 
## //========================== subread-align setting ===========================\\
## ||                                                                            ||
## || Function      : Read alignment (RNA-Seq)                                   ||
## || Input file    : combined.fastq                                             ||
## || Output file   : out.aln.bam (BAM)                                          ||
## || Index name    : ERCC_index                                                 ||
## ||                                                                            ||
## ||                    ------------------------------------                    ||
## ||                                                                            ||
## ||                       Threads : 1                                          ||
## ||                  Phred offset : 64                                         ||
## ||                     Min votes : 3 / 10                                     ||
## ||    Maximum allowed mismatches : 3                                          ||
## ||   Maximum allowed indel bases : 5                                          ||
## || # of best alignments reported : 1                                          ||
## ||                Unique mapping : no                                         ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
## 
## //================ Running (30-Apr-2018 21:38:37, pid=32180) =================\\
## ||                                                                            ||
## || The input file contains base space reads.                                  ||
## || The range of Phred scores observed in the data is [1,1]                    ||
## || Load the 1-th index block...                                               ||
## ||   14% completed, 0.0 mins elapsed, rate=73.3k reads per second             ||
## ||   21% completed, 0.0 mins elapsed, rate=71.6k reads per second             ||
## ||   27% completed, 0.0 mins elapsed, rate=73.2k reads per second             ||
## ||   34% completed, 0.0 mins elapsed, rate=73.1k reads per second             ||
## ||   41% completed, 0.0 mins elapsed, rate=74.6k reads per second             ||
## ||   47% completed, 0.0 mins elapsed, rate=73.8k reads per second             ||
## ||   54% completed, 0.0 mins elapsed, rate=74.6k reads per second             ||
## ||   61% completed, 0.0 mins elapsed, rate=72.9k reads per second             ||
## ||   70% completed, 0.0 mins elapsed, rate=19.6k reads per second             ||
## ||   73% completed, 0.0 mins elapsed, rate=20.2k reads per second             ||
## ||   77% completed, 0.0 mins elapsed, rate=20.6k reads per second             ||
## ||   81% completed, 0.0 mins elapsed, rate=20.9k reads per second             ||
## ||   84% completed, 0.0 mins elapsed, rate=21.2k reads per second             ||
## ||   88% completed, 0.0 mins elapsed, rate=21.6k reads per second             ||
## ||   92% completed, 0.0 mins elapsed, rate=21.9k reads per second             ||
## ||   95% completed, 0.0 mins elapsed, rate=22.2k reads per second             ||
## ||   99% completed, 0.0 mins elapsed, rate=22.5k reads per second             ||
## ||                                                                            ||
## ||                          Completed successfully.                           ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================================= Summary ==================================\\
## ||                                                                            ||
## ||           Processed : 46,009 reads                                         ||
## ||              Mapped : 46,009 reads (100.0%), wherein                       ||
## ||     Uniquely mapped : 46,009                                               ||
## ||       Multi-mapping : 0                                                    ||
## ||                                                                            ||
## ||          Not mapped : 0                                                    ||
## ||                                                                            ||
## ||              Indels : 0                                                    ||
## ||                                                                            ||
## ||        Running time : 0.0 minutes                                          ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//

Assigning reads to annotated exons

After the read alignment, we assign reads to exons based on the annotation provided using the sc_exon_mapping function. Currently scPipe only supports annotation in gff3 or bed format.

## adding annotation files...
## adding gff3 annotation: /tmp/Rtmp8BfmcC/Rinst73994d2b641a/scPipe/extdata/ERCC92_anno.gff3
## Annotation source not recognised, defaulting to ENSEMBL. Current supported sources: ENSEMBL, GENCODE and RefSeq
## time elapsed: 1 milliseconds
## 
## annotating exon features...
## number of read processed: 46009
## unique map to exon: 46009 (100.00%)
## ambiguous map to multiple exon: 0 (0.00%)
## map to intron: 0 (0.00%)
## not mapped: 0 (0.00%)
## unaligned: 0 (0.00%)
## time elapsed: 154 milliseconds

De-multiplexing data and counting genes

Next we use the sc_demultiplex function to split the reads into separate .csv files for each cell in the /count subfolder. Each file contains three columns and each row corresponds to a distinct read. The first column contains the gene ID that the read maps to, the second column gives the UMI sequence for that read and the third column is the distance to the transcript end position (TES) in bp. This file will be used for UMI deduplication and to generate a gene count matrix by calling the sc_gene_counting function.

## demultiplexing reads by barcode...
## Warning: mitochondrial chromosome not found using chromosome name `MT`.
## time elapsed: 77 milliseconds
## 
## summarising gene counts...
## time elapsed: 469 milliseconds

We have now completed the preprocessing steps. The gene count matrix is available as a .csv file in data_dir/gene_count.csv and quality control statistics are saved in the data_dir/stat folder. These data are useful for later quality control (QC).

Preprocessing data generated by other protocols

The greatest difference between scRNA-seq protocols is in their read structure, which is specified by the read_structure argument of the sc_trim_barcode function in scPipe. The major difference between CEL-seq and Drop-seq/Chromium 10x is that the cell barcode is unknown in advance for the latter protocols, so an extra step sc_detect_bc is required before running sc_demultiplex, to identify and generate the cell barcode annotation. scPipe is not optimized for the full-lengh protocols like SMART-seq, but it can process the data generated by such protocols by setting the has_UMI to FALSE in the sc_demultiplex function.

Quality Control

The easiest way to create a SingleCellExperiment object from the output of scPipe preprocessing is using create_sce_by_dir function, that will read in the gene count matrix together with the QC information available in the stat folder.

## [1] 92 10

The dataset we analysed above used ERCC simulated reads from 10 cells of perfect quality. In order to demonstrate QC on a more reaslitic example, we have included an experimental dataset in the scPipe software generated by Dr Christine Biben. This experiment profiles gene expression in 383 blood cells for a subset of 1000 genes (to save space). We first create a SingleCellExperiment object directly without using the create_sce_by_dir function:

## [1] "organism/gene_id_type not provided. Make a guess: mmusculus_gene_ensembl / ensembl_gene_id"

There are several plots we can create to assess the overall quality of the experiment. The first is to look at the cell barcode demultiplexing statistics. To do this we generate a bar plot that shows the percentage of reads that uniquely match to the cell barcodes, as well as the unmatched proportion of reads and their alignment rates to introns and exons. If we observe a large proportion of unmatched reads that map to exons, this indicates a failure of the cell barcode demultiplexing.

A second plot shows the duplication rate which can be used to evaluate read deapth. UMIs are routinely used to mark individual molecules and after PCR amplification, the same molecule with have multiple copys, which can be identified and removed if they are observed to have the same UMI sequence. Therefore, the copy number of UMIs is an indication of the PCR amplification rate.

Next we calculate QC metrics and use the detect_outlier function to identify poor quality cells. The detect_outlier function has argument comp to define the maximum component of the gaussian mixture model. Using the default value of 1 is generally sufficient, but in cases where the data are heterogeneous in terms of quality control metrics, setting this value to 2 or 3 can give better results. This function will remove low quality cells if type="low", large cells if type="high" or both when type="both". The conf argument specifies the lower and upper confidence intervals for outliers and detect_outlier is insensistive to the interval values.

## [1] "no ERCC spike-in. Skip `non_ERCC_percent`"

We can plot the alignment statistics per sample as a barplot using

and generate pairwise plots of the QC metrics as follows:

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`hjust`)

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`hjust`)

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`hjust`)

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`hjust`)

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`hjust`)

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`hjust`)

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`hjust`)

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`hjust`)

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`hjust`)

## Warning: The plyr::rename operation has created duplicates for the
## following name(s): (`hjust`)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The final step will be to remove the low quality cells by the remove_outliers function.

## [1] 1000  362

Downstream analysis

Since the scater and scran packages both use the SingleCellExperiment class, it will be easy to further process this data using these packages for normalization and visualization. Other packages such as SC3 may be useful for clustering and MAST and edgeR for differential expression analysis.

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=en_US.UTF-8    
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] scPipe_1.2.0                SingleCellExperiment_1.2.0 
##  [3] SummarizedExperiment_1.10.0 DelayedArray_0.6.0         
##  [5] BiocParallel_1.14.0         matrixStats_0.53.1         
##  [7] Biobase_2.40.0              GenomicRanges_1.32.0       
##  [9] GenomeInfoDb_1.16.0         IRanges_2.14.0             
## [11] S4Vectors_0.18.0            BiocGenerics_0.26.0        
## [13] ggplot2_2.2.1              
## 
## loaded via a namespace (and not attached):
##  [1] mclust_5.4             Rcpp_0.12.16           lattice_0.20-35       
##  [4] prettyunits_1.0.2      assertthat_0.2.0       rprojroot_1.3-2       
##  [7] digest_0.6.15          Rsubread_1.30.0        R6_2.2.2              
## [10] org.Mm.eg.db_3.6.0     plyr_1.8.4             backports_1.1.2       
## [13] RSQLite_2.1.0          evaluate_0.10.1        httr_1.3.1            
## [16] pillar_1.2.2           zlibbioc_1.26.0        rlang_0.2.0           
## [19] progress_1.1.2         lazyeval_0.2.1         curl_3.2              
## [22] blob_1.1.1             Matrix_1.2-14          rmarkdown_1.9         
## [25] labeling_0.3           Rhtslib_1.12.0         stringr_1.3.0         
## [28] RCurl_1.95-4.10        bit_1.1-12             biomaRt_2.36.0        
## [31] munsell_0.4.3          compiler_3.5.0         pkgconfig_2.0.1       
## [34] htmltools_0.3.6        tibble_1.4.2           GenomeInfoDbData_1.1.0
## [37] XML_3.98-1.11          reshape_0.8.7          MASS_7.3-50           
## [40] bitops_1.0-6           grid_3.5.0             GGally_1.3.2          
## [43] gtable_0.2.0           DBI_0.8                magrittr_1.5          
## [46] scales_0.5.0           stringi_1.1.7          reshape2_1.4.3        
## [49] XVector_0.20.0         robustbase_0.93-0      org.Hs.eg.db_3.6.0    
## [52] RColorBrewer_1.1-2     tools_3.5.0            bit64_0.9-7           
## [55] DEoptimR_1.0-8         yaml_2.1.18            AnnotationDbi_1.42.0  
## [58] colorspace_1.3-2       memoise_1.1.0          knitr_1.20