scRecover
is an R package for
imputation of single-cell RNA-seq (scRNA-seq) data. It
will detect and impute dropout values in a scRNA-seq raw read counts
matrix while keeping the real zeros unchanged.
Since there are both dropout zeros and real zeros in scRNA-seq data,
imputation methods should not impute all the zeros to non-zero values.
To distinguish dropout and real zeros,
scRecover
employs the Zero-Inflated
Negative Binomial (ZINB) model for dropout probability estimation of
each gene and accumulation curves for prediction of dropout number in
each cell. By combination with scImpute, SAVER and MAGIC, it not only
detects dropout and real zeros at higher accuracy, but
also improve the downstream clustering and visualization
results.
If you use scRecover
in published
research, please cite:
To install scRecover
from Bioconductor:
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("scRecover")
To install the developmental version from Bioconductor:
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("scRecover", version = "devel")
Or install the developmental version from GitHub:
if(!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("miaozhun/scRecover")
To load scRecover
and other required
packages for the vignettes in R:
library(scRecover)
library(BiocParallel)
suppressMessages(library(SingleCellExperiment))
scRecover
takes two inputs:
counts
and one of Kcluster
or
labels
.
The input counts
is a scRNA-seq read counts
matrix or a SingleCellExperiment
object which contains the read counts matrix. The rows of the matrix are
genes and columns are cells.
Kcluster
is an integer specifying the number of cell
subpopulations. This parameter can be determined based on prior
knowledge or clustering of raw data. Kcluster
is used to
determine the candidate neighbors of each cell and need not to be very
accurate.
labels
is a character/integer vector specifying the cell
type of each column in the raw count matrix. Only needed when
Kcluster = NULL
. Each cell type should have at least two
cells for imputation.
Users can load the test data in
scRecover
by
data(scRecoverTest)
The test data counts
in scRecoverTest
is a
scRNA-seq read counts matrix which has 200 genes (rows) and 150 cells
(columns).
dim(counts)
#> [1] 200 150
counts[1:6, 1:6]
#> E3.46.3383 E3.51.3425 E3.46.3388 E3.51.3423 E3.46.3382 E3.49.3407
#> BTG4 22 0 12 26 0 0
#> GABRB1 0 0 0 0 0 0
#> IL9 0 0 0 0 0 0
#> TAPBPL 2 0 5 1 0 2
#> KANK4 0 0 0 0 0 0
#> CPSF2 12 0 95 0 5 115
The object labels
in scRecoverTest
is a
vector of integer specifying the cell types in the read counts matrix,
corresponding to the columns of counts
.
length(labels)
#> [1] 150
table(labels)
#> labels
#> 1 2
#> 50 100
The object oneCell
in scRecoverTest
is a
vector of a cell’s raw read counts for each gene.
head(oneCell)
#> Vsx2_1a_F2
#> Itm2a 0
#> Gm26258 0
#> Sergef 149
#> Gm24106 0
#> Fam109a 0
#> Ssu72 60
length(oneCell)
#> [1] 24538
Here is an example to run scRecover
with read counts matrix input:
# Load test data for scRecover
data(scRecoverTest)
# Run scRecover with Kcluster specified
scRecover(counts = counts, Kcluster = 2, outputDir = "./outDir_scRecover/", verbose = FALSE)
#> [1] "========================= Running scImpute ========================="
#> [1] "reading in raw count matrix ..."
#> [1] "number of genes in raw count matrix 200"
#> [1] "number of cells in raw count matrix 150"
#> [1] "reading finished!"
#> [1] "imputation starts ..."
#> [1] "searching candidate neighbors ... "
#> [1] "inferring cell similarities ..."
#> [1] "dimension reduction ..."
#> [1] "calculating cell distances ..."
#> [1] "cluster sizes:"
#> [1] 92 54
#> [1] "estimating dropout probability for type 1 ..."
#> [1] "searching for valid genes ..."
#> [1] "imputing dropout values for type 1 ..."
#> [1] "estimating dropout probability for type 2 ..."
#> [1] "searching for valid genes ..."
#> [1] "imputing dropout values for type 2 ..."
#> [1] "writing imputed count matrix ..."
#> [1] "========================= scImpute finished ========================"
#>
#> Processing 1 of 2 cell clusters
#>
#> Processing 2 of 2 cell clusters
#>
#> [1] "======================== scRecover finished! ========================"
#> [1] "The output files of scRecover are in directory:"
#> [1] "./outDir_scRecover/"
#> [1] "============================== Cheers! =============================="
# Or run scRecover with labels specified
# scRecover(counts = counts, labels = labels, outputDir = "./outDir_scRecover/")
The SingleCellExperiment
class is a widely used S4 class for storing single-cell genomics data.
scRecover
also could take the
SingleCellExperiment
data representation as input.
Here is an example to run scRecover
with SingleCellExperiment
input:
# Load test data for scRecover
data(scRecoverTest)
# Convert the test data in scRecover to SingleCellExperiment data representation
sce <- SingleCellExperiment(assays = list(counts = as.matrix(counts)))
# Run scRecover with SingleCellExperiment input sce (Kcluster specified)
scRecover(counts = sce, Kcluster = 2, outputDir = "./outDir_scRecover/", verbose = FALSE)
#> [1] "========================= Running scImpute ========================="
#> [1] "reading in raw count matrix ..."
#> [1] "number of genes in raw count matrix 200"
#> [1] "number of cells in raw count matrix 150"
#> [1] "reading finished!"
#> [1] "imputation starts ..."
#> [1] "searching candidate neighbors ... "
#> [1] "inferring cell similarities ..."
#> [1] "dimension reduction ..."
#> [1] "calculating cell distances ..."
#> [1] "cluster sizes:"
#> [1] 92 54
#> [1] "estimating dropout probability for type 1 ..."
#> [1] "searching for valid genes ..."
#> [1] "imputing dropout values for type 1 ..."
#> [1] "estimating dropout probability for type 2 ..."
#> [1] "searching for valid genes ..."
#> [1] "imputing dropout values for type 2 ..."
#> [1] "writing imputed count matrix ..."
#> [1] "========================= scImpute finished ========================"
#>
#> Processing 1 of 2 cell clusters
#>
#> Processing 2 of 2 cell clusters
#>
#> [1] "======================== scRecover finished! ========================"
#> [1] "The output files of scRecover are in directory:"
#> [1] "./outDir_scRecover/"
#> [1] "============================== Cheers! =============================="
# Or run scRecover with SingleCellExperiment input sce (labels specified)
# scRecover(counts = sce, labels = labels, outputDir = "./outDir_scRecover/")
Function estDropoutNum
in the package could estimate the
dropout gene number or all expressed gene number (namely observed gene
number plus dropout gene number) in a cell:
# Load test data
data(scRecoverTest)
# Downsample 10% read counts in oneCell
set.seed(999)
oneCell.down <- countsSampling(counts = oneCell, fraction = 0.1)
# Count the groundtruth dropout gene number in the downsampled cell
sum(oneCell.down == 0 & oneCell != 0)
#> [1] 2095
# Estimate the dropout gene number in the downsampled cell by estDropoutNum
estDropoutNum(sample = oneCell.down, depth = 10, return = "dropoutNum")
#> [1] 1994
Blow shows the expressed gene number predicted by
estDropoutNum
with 10% downsampled reads and the
groundtruth expressed gene number derived by downsampling when the reads
depth varying from 0% to 100% of the total reads.
Imputed expression matrices of
scRecover
will be saved in the output
directory specified by or a folder named with prefix ‘outDir_scRecover_’
under the current working directory when is unspecified.
scRecover
integrates parallel computing
function with BiocParallel
package. Users could just set parallel = TRUE
(default) in
function scRecover
to enable parallelization and leave the
BPPARAM
parameter alone.
# Run scRecover with Kcluster specified
scRecover(counts = counts, Kcluster = 2, parallel = TRUE)
# Run scRecover with labels specified
scRecover(counts = counts, labels = labels, parallel = TRUE)
Advanced users could use a BiocParallelParam
object from
package BiocParallel
to fill in the BPPARAM
parameter to specify the parallel back-end to be used and its
configuration parameters.
The best choice for Unix and Mac users is to use
MulticoreParam
to configure a multicore parallel
back-end:
# Set the parameters and register the back-end to be used
param <- MulticoreParam(workers = 18, progressbar = TRUE)
register(param)
# Run scRecover with 18 cores (Kcluster specified)
scRecover(counts = counts, Kcluster = 2, parallel = TRUE, BPPARAM = param)
# Run scRecover with 18 cores (labels specified)
scRecover(counts = counts, labels = labels, parallel = TRUE, BPPARAM = param)
For Windows users, use SnowParam
to configure a Snow
back-end is a good choice:
# Set the parameters and register the back-end to be used
param <- SnowParam(workers = 8, type = "SOCK", progressbar = TRUE)
register(param)
# Run scRecover with 8 cores (Kcluster specified)
scRecover(counts = counts, Kcluster = 2, parallel = TRUE, BPPARAM = param)
# Run scRecover with 8 cores (labels specified)
scRecover(counts = counts, labels = labels, parallel = TRUE, BPPARAM = param)
See the Reference
Manual of BiocParallel
package for more details of the BiocParallelParam
class.
We evaluated SAVER, scImpute, MAGIC and their combined with scRecover version SAVER+scRecover, scImpute+scRecover, MAGIC+scRecover on a downsampling scRNA-seq dataset generated by random sampling of reads from a SMART-seq2 scRNA-seq dataset (Petropoulos S, et al. Cell, 2016, https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3929/).
We found after combined with scRecover, scImpute+scRecover, SAVER+scRecover and MAGIC+scRecover will have higher accuracy than scImpute, SAVER and MAGIC respectively.