1 Introduction

This is the first part of the Bioc2016 workshop “Analysis of single-cell RNA-seq data with R and Bioconductor.”

In this part we will cover single-cell RNA-Seq quality control (QC) and normalization with the scone package. The package is available on Github and will be submitted to Bioconductor in the near future.

Single-cell RNA sequencing (scRNA-Seq) technologies are opening the way for transcriptome-wide profiling across diverse and complex mammalian tissues, facilitating unbiased identification of novel cell sub-populations and their functional roles. As in other high-throughput assays, a fraction of the heterogeneity observed in scRNA-Seq data results from batch effects and other technical artifacts. In particular, these protocols’ reliance on minuscule amounts of starting mRNA can lead to widespread “drop-out effects,” in which expressed transcripts are missed. Due to the biases inherent to these assays, data normalization is an essential step prior to any downstream analyses. Furthermore, due to wide-range of scRNA-Seq study designs used in the field, we cannot expect to find a one-size-fits-all solution to these problems.

scone supports a rational, data-driven framework for assessing the efficacy of various normalization workflows, encouraging users to explore trade-offs inherent to their data set prior to finalizing a data normalization strategy. We provide an interface for running multiple normalization workflows in parallel. We also offer tools for ranking workflows and visualizing trade-offs. We import some common normalization modules used in traditional bulk sequencing, and provide support for integrating user-specified normalization modules.

1.1 An Example Dataset

The first two parts of this workshop will analyze a small set of cells lying along a developmental trajectory. We will start from raw data objects obtained from a standard transcriptome alignment pipeline. Raw data and important reference data can be loaded directly from the workshop package.

library(bioc2016singlecell)

## Load Example Data
data(ws_input)
ls()

##  [1] "batch"          "bio"            "breaks"         "cl"            
##  [5] "cl3"            "cl7"            "cm"             "cont_levels"   
##  [9] "contrMat"       "counts"         "de"             "defaultMar"    
## [13] "fit"            "genesDendro"    "genesF"         "genesOneAll"   
## [17] "genesPairs"     "hk"             "manual"         "mast_res"      
## [21] "norm"           "norm_logcounts" "pca"            "plotCMar"      
## [25] "qc"             "res1"           "res2"           "res3"          
## [29] "rs"             "sca"            "se"             "wh_rm"

The counts data frame contains feature-level read counts from tophat alignments of 96 single-cell libraries to the mm10 reference genome (Trapnell, Pachter, and Salzberg 2009). qc contains library alignment metrics obtained from Picard.

colnames(qc)

##  [1] "NREADS"               "NALIGNED"             "RALIGN"              
##  [4] "TOTAL_DUP"            "PRIMER"               "PCT_RIBOSOMAL_BASES" 
##  [7] "PCT_CODING_BASES"     "PCT_UTR_BASES"        "PCT_INTRONIC_BASES"  
## [10] "PCT_INTERGENIC_BASES" "PCT_MRNA_BASES"       "MEDIAN_CV_COVERAGE"  
## [13] "MEDIAN_5PRIME_BIAS"   "MEDIAN_3PRIME_BIAS"

de and hk are positive and negative control gene sets derived from population studies. batch and bio are factors labeling batch and time point respectively. Consider the joint distribution of these factors:

## Joint distribution of batches and biological conditions (time after induction)
table(batch,bio)

##       bio
## batch  time0 time1 time2 time3 time4 time5
##   Y01     10     0     0     0     0     0
##   Y02      0     0     0     6     0     0
##   Y03      0     0     0     8     0     0
##   Y04      8     0     0     0     0     0
##   Y05A     0    10     0     0     0     0
##   Y05B     0     9     0     0     0     0
##   Y06      0     0     5     0     0     0
##   Y07A     0     0     9     0     0     0
##   Y07B     0     0    10     0     0     0
##   Y08      0     6     0     0     0     0
##   Y10      0     0     0     0     2     0
##   Y11      0     0     0     0     4     0
##   Y12A     0     0     0     0     0     2
##   Y12B     0     0     0     0     0     2
##   Y13      0     0     0     0     0     5

Notice that each time-point is composed of multiple batches. This nested design is common among scRNA-Seq studies due to limitations on the number of cells that can be harvested/sequenced concurrently.

1.2 Technical Variability and Batch Effects

We will first visualize cell level quality readouts, starting with the ratio of reads aligned to the genome:

# Color scheme
cc <- c(brewer.pal(9, "Set1"), brewer.pal(8, "Set2"),brewer.pal(9,"Set3"))

# Barplot of read proportion mapping to the genome
barplot(qc$RALIGN[order(batch)], col=cc[batch][order(batch)], border=cc[batch][order(batch)], main="Percentage of mapped reads, colored by batch")
legend("bottomleft", legend=levels(batch), fill=cc,cex=0.4)

We see that there aren’t important differences between batches, while there are a few cells of particularly low alignment efficiency. These cells can be removed via filtering procedures. We can alternatively consider the number of reads in each library:

# Barplot of total read number
barplot(qc$NREADS[order(batch)], col=cc[batch][order(batch)], border=cc[batch][order(batch)], main="Total number of reads, colored by batch")
legend("topright", legend=levels(batch), fill=cc, cex=0.4)

We see that coverage varies substantially between batches, and though some of this technical variability may be addressed with filtering, we will still need to carefully normalize the data to make batches comparable.

It can be very helpful to visualize distributions of single metrics, but it’s important to note that these metrics are often correlated. Sometimes it may be more useful to consider principal components of the quality matrix, identifying principal factors of library variation:

qpc = prcomp(qc,center = TRUE,scale. = TRUE)
barplot(cumsum(qpc$sdev^2)/sum(qpc$sdev^2), border="gray", xlab="PC", ylab="proportion of variance", main="Quality PCA")

Even though 14 quality metrics have been quantified, PCA shows us that only a small number of PCs are needed to described a majority of the variance (e.g. 3 to explain 74%). We will now visualize the distribution of the first PC in the context of batch:

barplot(qpc$x[,1][order(batch)], col=cc[batch][order(batch)], border=cc[batch][order(batch)], main="Quality PC1, colored by batch")
legend("bottomleft", legend=levels(batch), fill=cc, cex=0.8)

This PC appears to correlate with batch (particularly Y04), but it also highlights 5 outlier libraries that are significantly different from the rest. While we won’t discuss the application of these PCs to filtering in this workshop, we will show how they can be applied to normalization.

1.3 Drop-out Characteristics

Next we consider a uniquely single-cell problem: drop-outs. hk contains a list of genes that are believed to be ubiquitously and uniformly expressed across the target tissue. Because we assume these genes are expressed in all cells, we can label all zeroes as drop-out events. Below we model detection failures as a logistic function of mean expression, in line with the standard logistic model for drop-outs employed by the field.

# Mean log10(x+1) expression
mu_obs = rowMeans(log10(counts[hk,]+1))

# Drop-outs
drop_outs = counts[hk,] == 0

# Logistic Regression Model of Failure
ref.glms = list()
for (si in 1:dim(drop_outs)[2]){
  fit = glm(cbind(drop_outs[,si],1 - drop_outs[,si]) ~ mu_obs,family=binomial(logit))
  ref.glms[[si]] = fit$coefficients
}

The list ref.glm contains the intercept and slope of each fit. We can now visualize the fit curves and the corresponding Area Under the Curves (AUCs):

par(mfrow=c(1,2))

# Plot Failure Curves and Calculate AUC
plot(NULL, main = "False Negative Rate Curves",ylim = c(0,1),xlim = c(0,6), ylab = "Failure Probability", xlab = "Mean log10 Expression")
x = (0:60)/10
AUC = NULL
for(si in 1:ncol(counts)){
  y = 1/(exp(-ref.glms[[si]][1] - ref.glms[[si]][2] * x) + 1)
  AUC[si] = sum(y)/10
  lines(x, 1/(exp(-ref.glms[[si]][1] - ref.glms[[si]][2] * x) + 1), type = 'l', lwd = 2, col = cc[batch][si])
}

# Barplot of FNR AUC
barplot(AUC[order(batch)], col=cc[batch][order(batch)], border=cc[batch][order(batch)], main="FNR AUC, colored by batch")
legend("topleft", legend=levels(batch), fill=cc, cex=0.4)

You may have noticed that the 5 outlier cells seen here are the same as the 5 outliers highlighted in the PCA of quality metrics above. Drop-out summaries such as the AUC can very useful for assessing single-cell library quality.

1.4 The `scone` Workflow

So far we have only described potential problems with our data set. Now we will take steps to address them! The basic qc and normalization pipeline we will use today will allow us to:

Filter out poor libraries using the metric_sample_filter function.
Run and score many different normalization workflows (different combinations of normalization modules) using the main scone function.
Browse top-ranked methods and visualize trade-offs with the biplot_colored function.

The SCONE’s normalization workflow template is composed of 3 modules:

Data imputation: replacing zero-abundance values with expected values under a drop-out model. We will only consider identity imputation (no imputation) in this example.
Scaling or quantile normalization: either i) normalization that scales each sample’s transcriptome abundances by a single factor or ii) more complex offsets that match quantiles across samples. Examples: TMM or DESeq scaling factors, upper quartile normalization, or full-quantile normalization.
Regression-based approaches for removing unwanted correlated variation from the data. Examples: RUVg (Risso et al. 2014) or regression on Quality Principal Components described above.

2 Step 1: Sample Filtering with `metric_sample_filter`

The most basic sample filtering function in scone is the metric_sample_filter. It takes as input an expression matrix. The output depends on arguments provided, but generally consists of a list of 4 logicals designating each sample as having failed (TRUE) or passed (FALSE) a threshold-based filter on 4 metrics

Number of reads.
Ratio of reads aligned to the genome. Requires the ralign argument.
“Transcriptome breadth” - Defined here as the proportion of high-quality genes detected in the sample. Requires the gene_filter argument.
FNR AUC. Requires the pos_controls argument.

If required arguments are missing for any of the 4, the function will simply return NA instead of the corresponding logical.

mfilt_report <- metric_sample_filter(expr = counts,
                                     nreads = qc$NREADS,ralign = qc$RALIGN,
                                     suff_nreads = 10^5,
                                     suff_ralign = 90,
                                     pos_controls = hk,
                                     zcut = 3,mixture = FALSE, plot = TRUE)

In the call above, we have set the following parameters:

suff_nreads = 10^5. Sets max (or “sufficient”) threshold for nreads filter.
suff_ralign = 90. Sets max threshold for ralign filter.
zcut = 3. Filter leniency (see below).
mixture = FALSE. Mixture modeling will not be used (see below).
plot = TRUE. Plot distributions of metrics before and after filtering.

2.1 On Threshold Selection

Let’s take a closer look at the computation behind the ralign filter. In selecting the threshold value 90, metric_sample_filter is taking 4 values into account:

hard_ralign, the default “hard” threshold at 15 - rather forgiving…
3 (zcut) times the standard deviation below the mean ralign value.
3 (zcut) times the MAD below the median ralign value.
suff_ralign, the sufficient threshold set to 90.

hist(qc$RALIGN, breaks = 0:100)
# Hard threshold
abline(v = 15, col = "yellow", lwd = 2) 
# 3 (zcut) standard deviations below the mean ralign value
abline(v = mean(qc$RALIGN) - 3*sd(qc$RALIGN), col = "green", lwd = 2) 
# 3 (zcut) MADs below the median ralign value
abline(v = median(qc$RALIGN) - 3*mad(qc$RALIGN), col = "red", lwd = 2)
# Sufficient threshold
abline(v = 90, col = "grey", lwd = 2)
# Final threshold is the minimum of 1) the sufficient threshold and 2) the max of all others
thresh = min(90,max(c(15,mean(qc$RALIGN) - 3*sd(qc$RALIGN),median(qc$RALIGN) - 3*mad(qc$RALIGN))))
abline(v = thresh, col = "blue", lwd = 2, lty = 2)

legend("topleft",legend = c("Hard","SD","MAD","Sufficient","Final"),lwd = 2, col = c("yellow","green","red","grey","blue"),lty = c(1,1,1,1,2), cex = .5)

We see here that the 3rd threshold exceeds the sufficient threshold, and so metric_sample_filter settles for the sufficient threshold. Note that if mixture=TRUE an additional criterion is included: bi-modal metric distributions are fit with to a two component mixture model, and a threshold is defined with respect to the mean and standard deviation of the “better” component. As metric_sample_filter relies on a maximum of candidate thresholds, we recommend users treat this function as a liberal filter.

2.2 Applying the sample filter

With metric_sample_filter output in hand, filtering out the poor samples is fairly straightforward:

# Which thresholds are missing? (breadth)
is_na_filt = unlist(lapply(is.na(mfilt_report),any)) 

# Identify samples failing any threshold
m_sampfilter = !apply(simplify2array(mfilt_report[!is_na_filt]),1,any)

# Filter Samples
fcounts = counts[,m_sampfilter]
fqc = qc[m_sampfilter,]
fbatch = batch[m_sampfilter]
fbio = bio[m_sampfilter]

# Simple gene filter
filterCount <- function(counts, nRead=5, nCell=5){
  filter <- apply(counts, 1, function(x) length(x[x>=nRead])>=nCell)
  return(filter)
}
genefilter <- filterCount(fcounts)

# Filter genes
fcounts = fcounts[genefilter,]
fhk = hk[hk %in% rownames(fcounts)]
fde = de[de %in% rownames(fcounts)]

3 Step 2: Run and Score Normalization Workflows with `scone`

As described earlier, not only does scone normalize expression data, but it also provides a framework for evaluation the performance of various normalization workflows. In order to run the scone function, we will need to decide which workflows we will want to compare.

3.1 Selecting Workflow Parameters with `run=FALSE`

The arguments to scone allow for a lot of flexibility, but sometimes you will need to run very specific combinations of modules. For this purpose, scone can be run in run=FALSE mode, returning only a data frame of workflows to be performed.

params <- scone(expr = as.matrix(fcounts),scaling = c(none = identity,deseq = DESEQ_FN, tmm = TMM_FN, uqp = UQ_FN_POS, fq = FQT_FN),
                ruv_negcon = fhk, k_ruv = 3,
                qc = as.matrix(fqc), k_qc = 3,
                bio = fbio,adjust_bio = "yes",
                batch = fbatch,adjust_batch = "yes",
                run = FALSE)
head(params)

##                                   imputation_method scaling_method
## none,none,no_uv,no_bio,no_batch                none           none
## none,deseq,no_uv,no_bio,no_batch               none          deseq
## none,tmm,no_uv,no_bio,no_batch                 none            tmm
## none,uqp,no_uv,no_bio,no_batch                 none            uqp
## none,fq,no_uv,no_bio,no_batch                  none             fq
## none,none,ruv_k=1,no_bio,no_batch              none           none
##                                   uv_factors adjust_biology adjust_batch
## none,none,no_uv,no_bio,no_batch        no_uv         no_bio     no_batch
## none,deseq,no_uv,no_bio,no_batch       no_uv         no_bio     no_batch
## none,tmm,no_uv,no_bio,no_batch         no_uv         no_bio     no_batch
## none,uqp,no_uv,no_bio,no_batch         no_uv         no_bio     no_batch
## none,fq,no_uv,no_bio,no_batch          no_uv         no_bio     no_batch
## none,none,ruv_k=1,no_bio,no_batch    ruv_k=1         no_bio     no_batch

In the call above, we have set the following parameters:

scaling = list(…). This argument contains a list of scaling normalization functions that will be applied, including the identity (no-op), DESeq scaling, TMM normalization, scaling by the upper quartile of positive counts, and full-quantile normalization.
ruv_negcon = fhk. A list of genes to be used as negative controls for RUVg normalization.
k_ruv = 3. The maximum number of RUVg factors to consider.
k_qc = 3. The maximum number of quality PCs (QPCs) to be included in a linear model, analogous to RUVg normalization. The qc argument must be provided.
adjust_bio = “yes.” Time points will be included in RUVg or QPC normalization and restored to the residual of the regression. The bio argument must be provided.
adjust_batch = “yes.” Batch will be included in RUVg or QPC normalization. The batch argument must be provided.

These arguments translate to the following set of options:

apply(params,2,unique)

## $imputation_method
## [1] "none"
## 
## $scaling_method
## [1] "none"  "deseq" "tmm"   "uqp"   "fq"   
## 
## $uv_factors
## [1] "no_uv"   "ruv_k=1" "ruv_k=2" "ruv_k=3" "qc_k=1"  "qc_k=2"  "qc_k=3" 
## 
## $adjust_biology
## [1] "no_bio" "bio"   
## 
## $adjust_batch
## [1] "no_batch" "batch"

While adjusting for biology prior to downstream analysis is generally problematic, it can be particularly dangerous in the single-cell context: there is significantly more sample-intrinsic biological variability in scRNA-Seq data, variability that cannot be captured by exposure. In our nested design scenario we would only consider adjustments for biology when batch effects are included. We can produce an updated params data frame reflecting this choice:

is_screened = (params$adjust_biology == "bio") & (params$adjust_batch != "batch")

params = params[!is_screened,]

3.2 Calling `scone` with `run=TRUE`

Now that we have selected our workflows, we can run scone in run=TRUE mode. This mode offers a few additional arguments, including an optional params argument to pass any results from the run=FALSE mode. In order to understand these arguments, we must first understand the 8 metrics used to evaluate each normalization. The first 6 metrics rely on a reduction of the normalized data down to 3 dimensions via PCA (default). Each metric is taken to have a positive (higher is better) or negative (lower is better) signature.

BIO_SIL. The average silhouette width of clusters defined by bio, defined with respect to a Euclidean distance metric over the first 3 expression PCs. Positive signature.
BATCH_SIL. The average silhouette width of clusters defined by batch, defined with respect to a Euclidean distance metric over the first 3 expression PCs. Negative signature.
PAM_SIL. The maximum average silhouette width of clusters defined by PAM clustering, defined with respect to a Euclidean distance metric over the first 3 expression PCs. Positive signature.
EXP_QC_COR. Maximum squared Spearman correlation between first 3 expression PCs and first k_qc QPCs. Negative signature.
EXP_UV_COR. Maximum squared Spearman correlation between first 3 expression PCs and first 3 PCs of the negative control (specified by eval_negcon or ruv_negcon by default) sub-matrix of the original (raw) data. Negative signature.
EXP_WV_COR. Maximum squared Spearman correlation between first 3 expression PCs and first 3 PCs of the positive control (specified by eval_poscon) sub-matrix of the original (raw) data. Positive signature.
RLE_MED. The mean squared median Relative Log Expression (RLE). Negative signature.
RLE_IQR. The mean inter-quartile range (IQR) of the RLE. Negative signature.

res <- scone(expr = as.matrix(fcounts),scaling = c(none = identity, deseq = DESEQ_FN, tmm = TMM_FN, uqp = UQ_FN_POS, fq = FQT_FN),
             ruv_negcon = fhk, k_ruv = 3,
             qc = as.matrix(fqc), k_qc = 3,
             bio = fbio,adjust_bio = "yes",
             batch = fbatch,adjust_batch = "yes",
             run = TRUE,params = params,
             eval_poscon = fde, eval_kclust = 2:3, conditional_pam = TRUE)

In the call above, we have set the following parameters:

eval_poscon = fde. A list of genes to be used as positive controls for evaluation.
eval_kclust = 2:3. For PAM_SIL, range of k (# of clusters) to use when computing maximum average silhouette width of PAM clusterings.
conditional_pam = TRUE. For PAM_SIL, apply separate PAM clusterings to each time point rather than across all time points. Average is weighted by time point group size.

Running the command will take a couple minutes. The output will contain a list of four elements:

names(res)

## [1] "normalized_data" "metrics"         "scores"          "params"

normalized_data contains a list of normalized expression data (log-scale); each expression matrix is named according to the same convention as seen in the params argument.

metrics contains the 8 raw metrics for each normalization. scores contains metrics multiplied by their signature - or “scores” - as well as a 9th column that contains the mean score for that normalization. Normalization workflows in normalized_data,metrics, and scores are sorted in decreasing order by mean score.

head(res$scores)

##                                       BIO_SIL BATCH_SIL   PAM_SIL
## none,fq,ruv_k=1,no_bio,no_batch    0.08653280 0.1508225 0.3741884
## none,fq,ruv_k=2,no_bio,no_batch    0.08126123 0.1558709 0.3716837
## none,fq,ruv_k=3,no_bio,no_batch    0.09865276 0.1415136 0.3629964
## none,deseq,ruv_k=2,no_bio,no_batch 0.09721160 0.1722529 0.3983736
## none,deseq,ruv_k=1,no_bio,no_batch 0.08828919 0.1640219 0.4215769
## none,fq,qc_k=1,no_bio,no_batch     0.08318459 0.1581687 0.4326760
##                                    EXP_QC_COR  EXP_UV_COR EXP_WV_COR
## none,fq,ruv_k=1,no_bio,no_batch    -0.2132825 -0.09864541  0.5405376
## none,fq,ruv_k=2,no_bio,no_batch    -0.2012192 -0.08994268  0.5313734
## none,fq,ruv_k=3,no_bio,no_batch    -0.2300302 -0.06228916  0.5779535
## none,deseq,ruv_k=2,no_bio,no_batch -0.2984754 -0.16153256  0.5042606
## none,deseq,ruv_k=1,no_bio,no_batch -0.3437892 -0.16409015  0.4916737
## none,fq,qc_k=1,no_bio,no_batch     -0.2336496 -0.28368051  0.5126642
##                                         RLE_MED    RLE_IQR  mean_score
## none,fq,ruv_k=1,no_bio,no_batch    -0.002169450 -0.9977360 -0.01996901
## none,fq,ruv_k=2,no_bio,no_batch    -0.002856720 -1.0625194 -0.02704359
## none,fq,ruv_k=3,no_bio,no_batch    -0.002686065 -1.1112346 -0.02814046
## none,deseq,ruv_k=2,no_bio,no_batch -0.007600576 -0.9843497 -0.03498245
## none,deseq,ruv_k=1,no_bio,no_batch -0.007537978 -0.9375725 -0.03592851
## none,fq,qc_k=1,no_bio,no_batch     -0.003546027 -0.9857528 -0.03999193

4 Step 3: Selecting a normalization for downstream analysis

Based on our sorting criteria, it would appear that none,fq,ruv_k=1,no_bio,no_batch performs well compared to other normalization workflows. A useful way to visualize this method with respect to others is the biplot_colored function

pc_obj = prcomp(res$scores[,-ncol(res$scores)],center = TRUE,scale = FALSE)
bp_obj = biplot_colored(pc_obj,y = -res$scores[,ncol(res$scores)],expand = .6)

We have colored each point above according the corresponding method’s mean score. We can see that workflows cluster largely by three scores: BIO_SIL, EXP_WV_COR, and EXP_UV_COR. Since clustering by biology is playing a strong role, we can highlight the points corresponding the biology adjustment:

bp_obj = biplot_colored(pc_obj,y = -res$scores[,ncol(res$scores)],expand = .6)

points(bp_obj[grepl(",bio,batch",rownames(bp_obj)),], pch = 1, col = "red", cex = 1)
points(bp_obj[grepl(",bio,batch",rownames(bp_obj)),], pch = 1, col = "red", cex = 1.5)

We see that the intermediate-performance cluster to the bottom-left is made up only of these methods. This example highlights the danger of adjusting for biology, even in the nested design scenario. Although biology-adjusted data may reflect prior differences more strongly, other metrics would suggest that the data is poorly behaved. We can also consider cases where only batch is accounted for:

bp_obj = biplot_colored(pc_obj,y = -res$scores[,ncol(res$scores)],expand = .6)

points(bp_obj[grepl(",no_bio,batch",rownames(bp_obj)),], pch = 1, col = "red", cex = 1)
points(bp_obj[grepl(",no_bio,batch",rownames(bp_obj)),], pch = 1, col = "red", cex = 1.5)

Here we see that the poorly performing cluster on the right is made up solely of these methods. This follows from the simple fact that batch is nested within biology, and adjustments for the former can remove critical biological variation. Finally we will visualize the top-performing method and it’s relation to un-normalized data:

bp_obj = biplot_colored(pc_obj,y = -res$scores[,ncol(res$scores)],expand = .6)

points(t(bp_obj[1,]), pch = 1, col = "red", cex = 1)
points(t(bp_obj[1,]), pch = 1, col = "red", cex = 1.5)

points(t(bp_obj[rownames(bp_obj) == rownames(params)[1],]), pch = 1, col = "blue", cex = 1)
points(t(bp_obj[rownames(bp_obj) == rownames(params)[1],]), pch = 1, col = "blue", cex = 1.5)

arrows(bp_obj[rownames(bp_obj) == rownames(params)[1],][1],
       bp_obj[rownames(bp_obj) == rownames(params)[1],][2],
       bp_obj[1,][1],
       bp_obj[1,][2],
       lty = 2, lwd = 2)

The arrow traces a line from the “no-op” normalization to the top-ranked normalization in SCONE. We see that SCONE has selected a method in-between the two extremes, reducing the signal of unwanted variation (as defined by fhk) while preserving biological signal (as defined by fbio and fde).

5 Session Info

sessionInfo()

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2            scone_0.0.5                  
##  [3] MAST_0.934                    reshape_0.8.5                
##  [5] cluster_2.0.4                 clusterExperiment_0.99.1-9003
##  [7] SummarizedExperiment_1.3.5    Biobase_2.33.0               
##  [9] GenomicRanges_1.25.8          GenomeInfoDb_1.9.1           
## [11] IRanges_2.7.11                S4Vectors_0.11.7             
## [13] BiocGenerics_0.19.1           bioc2016singlecell_0.0.1     
## [15] devtools_1.11.1               BiocStyle_2.1.10             
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_1.2-6        hwriter_1.3.2          
##   [3] class_7.3-14            modeltools_0.2-21      
##   [5] mclust_5.2              XVector_0.13.2         
##   [7] flexmix_2.3-13          AnnotationDbi_1.35.3   
##   [9] mvtnorm_1.0-5           xml2_0.1.2             
##  [11] codetools_0.2-14        splines_3.3.0          
##  [13] R.methodsS3_1.7.1       doParallel_1.0.10      
##  [15] bold_0.3.5              DESeq_1.25.0           
##  [17] robustbase_0.92-6       geneplotter_1.51.0     
##  [19] knitr_1.13              ade4_1.7-4             
##  [21] jsonlite_0.9.22         locfdr_1.1-8           
##  [23] Rsamtools_1.25.0        phylobase_0.8.2        
##  [25] gridBase_0.4-7          annotate_1.51.0        
##  [27] kernlab_0.9-24          R.oo_1.20.0            
##  [29] rentrez_1.0.2           httr_1.2.0             
##  [31] assertthat_0.1          Matrix_1.2-6           
##  [33] lazyeval_0.2.0          rotl_3.0.0             
##  [35] limma_3.29.10           formatR_1.4            
##  [37] htmltools_0.3.5         tools_3.3.0            
##  [39] gtable_0.2.0            taxize_0.7.8           
##  [41] reshape2_1.4.1          dplyr_0.4.3            
##  [43] ShortRead_1.31.0        Rcpp_0.12.5            
##  [45] NMF_0.20.6              trimcluster_0.1-2      
##  [47] Biostrings_2.41.4       gdata_2.17.0           
##  [49] ape_3.5                 nlme_3.1-128           
##  [51] rtracklayer_1.33.7      iterators_1.0.8        
##  [53] fpc_2.1-10              stringr_1.0.0          
##  [55] rngtools_1.2.4          gtools_3.5.0           
##  [57] XML_3.98-1.4            dendextend_1.2.0       
##  [59] edgeR_3.15.0            DEoptimR_1.0-4         
##  [61] zlibbioc_1.19.0         MASS_7.3-45            
##  [63] scales_0.4.0            aroma.light_3.3.0      
##  [65] yaml_2.1.13             RUVSeq_1.7.2           
##  [67] memoise_1.0.0           ggplot2_2.1.0          
##  [69] pkgmaker_0.22           segmented_0.5-1.4      
##  [71] biomaRt_2.29.2          latticeExtra_0.6-28    
##  [73] stringi_1.1.1           RSQLite_1.0.0          
##  [75] genefilter_1.55.2       foreach_1.4.3          
##  [77] GenomicFeatures_1.25.14 caTools_1.17.1         
##  [79] boot_1.3-18             BiocParallel_1.7.4     
##  [81] chron_2.3-47            prabclus_2.2-6         
##  [83] matrixStats_0.50.2      bitops_1.0-6           
##  [85] rncl_0.6.0              evaluate_0.9           
##  [87] lattice_0.20-33         GenomicAlignments_1.9.4
##  [89] rredlist_0.1.0          plyr_1.8.4             
##  [91] magrittr_1.5            R6_2.1.2               
##  [93] gplots_3.0.1            DBI_0.4-1              
##  [95] whisker_0.3-2           withr_1.0.2            
##  [97] mixtools_1.0.4          RCurl_1.95-4.8         
##  [99] survival_2.39-4         abind_1.4-3            
## [101] nnet_7.3-12             tibble_1.0             
## [103] EDASeq_2.7.2            uuid_0.1-2             
## [105] KernSmooth_2.23-15      howmany_0.3-1          
## [107] rmarkdown_0.9.6         RNeXML_2.0.6           
## [109] grid_3.3.0              data.table_1.9.6       
## [111] digest_0.6.9            diptest_0.75-7         
## [113] xtable_1.8-2            tidyr_0.5.1            
## [115] R.utils_2.3.0           munsell_0.4.3          
## [117] registry_0.3

Risso, Davide, John Ngai, Terence P. Speed, and Sandrine Dudoit. 2014. “Normalization of Rna-Seq Data Using Factor Analysis of Control Genes or Samples.” Nature Biotechnology 32: 896–902.

Trapnell, Cole, Lior Pachter, and Steven L. Salzberg. 2009. “TopHat: Discovering Splice Junctions with Rna-Seq.” Bioinformatics 25 (9): 1105–11.

Quality Control (QC) and Normalization

Michael Cole

2016-06-22

Contents

1 Introduction

1.1 An Example Dataset

1.2 Technical Variability and Batch Effects

1.3 Drop-out Characteristics

1.4 The `scone` Workflow

2 Step 1: Sample Filtering with `metric_sample_filter`

2.1 On Threshold Selection

2.2 Applying the sample filter

3 Step 2: Run and Score Normalization Workflows with `scone`

3.1 Selecting Workflow Parameters with `run=FALSE`

3.2 Calling `scone` with `run=TRUE`

4 Step 3: Selecting a normalization for downstream analysis

5 Session Info

Quality Control (QC) and Normalization

Michael Cole

2016-06-22

Contents

1 Introduction

1.1 An Example Dataset

1.2 Technical Variability and Batch Effects

1.3 Drop-out Characteristics

1.4 The scone Workflow

2 Step 1: Sample Filtering with metric_sample_filter

2.1 On Threshold Selection

2.2 Applying the sample filter

3 Step 2: Run and Score Normalization Workflows with scone

3.1 Selecting Workflow Parameters with run=FALSE

3.2 Calling scone with run=TRUE

4 Step 3: Selecting a normalization for downstream analysis

5 Session Info

1.4 The `scone` Workflow

2 Step 1: Sample Filtering with `metric_sample_filter`

3 Step 2: Run and Score Normalization Workflows with `scone`

3.1 Selecting Workflow Parameters with `run=FALSE`

3.2 Calling `scone` with `run=TRUE`