Contents

Overview of the workflow

AUCell allows to identify cells with active gene sets (e.g. signatures, gene modules) in single-cell RNA-seq data. In brief, the workflow to run AUCell is based on three steps:

  1. Build the rankings

  2. Calculate the Area Under the Curve (AUC)

  3. Set the assignment thresholds

library(AUCell)
cells_rankings <- AUCell_buildRankings(exprMatrix)

genes <- c("gene1", "gene2", "gene3")
geneSets <- list(geneSet1=genes)
# geneSets <- GSEABase::GeneSet(genes, setName="geneSet1") # alternative
cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings, aucMaxRank=nrow(cells_rankings)*0.05)

par(mfrow=c(3,3))
set.seed(123)
cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, nCores=1, assign=TRUE)

In the following sections we explain and explore the output of each of these steps. The details of the methods behind AUCell are described in the following article:t

## Aibar et al. (2017) SCENIC: single-cell regulatory network inference and clustering. Nature Methods. doi: 10.1038/nmeth.4463

Please, also cite this article if you use AUCell in your research.

Before starting

Setup

By default, AUCell is installed only with the minimum dependencies. To run AUCell in parallel or run the examples in this tutorial, we recommend to install these packages:

source("http://bioconductor.org/biocLite.R")
# To support paralell execution:
biocLite(c("doMC", "doRNG"))
# For the main example:
biocLite(c("mixtools", "GEOquery", "SummarizedExperiment"))
# For the examples in the follow-up section of the tutorial:
biocLite(c("DT", "plotly", "NMF", "d3heatmap", "shiny", "rbokeh"))

Some tips…

Help

At any time, remember you an access the help files for any function (i.e. ?AUCell_buildRankings). It is also possible to open this tutorial directly from R with the following commands:

# Explore tutorials in the web browser:
browseVignettes(package="AUCell") 

# Commnad line-based:
vignette(package="AUCell") # list
vignette("AUCell") # open

Report template

To generate an HTML report with your own data and comments, you can use the .Rmd file of this tutorial as template (i.e. copy the .Rmd file, and edit it as R notebook in RStudio).

vignetteFile <- paste(file.path(system.file('doc', package='AUCell')), "AUCell.Rmd", sep="/")
# Copy to edit as markdown
file.copy(vignetteFile, ".")
# Alternative: extract R code
Stangle(vignetteFile)

Running AUCell

0. Load scRNA-seq dataset and gene sets

The input data for AUCell are the expression matrix and the gene-sets.

Working directory

During this tutorial some plots and files are saved. To keep them tidy, we recommend to set the working directory to a new folder:

dir.create("AUCell_tutorial")
setwd("AUCell_tutorial") # or in the first code chunk (kntr options), if running as Notebook

Expression matrix

The expression matrix, from a single-cell RNA-seq dataset, should be formatted with the genes/features as rows and cells as columns.

Typically, this matrix will be loaded from a counts file, or from another R object. i.e.:

# i.e. Reading from a text file
exprMatrix <- read.table("myCountsMatrix.tsv")
exprMatrix <- as.matrix(exprMatrix)

# or using an expression set
exprMatrix <- exprs(myExpressionSet)

In this tutorial we use a dataset containing 3005 cells from mouse cortex and hippocampus. The dataset can be downloaded from GEO accession number GSE60361.

Zeisel, A., et al. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347, 1138–1142. doi: 10.1126/science.aaa1934

# (This may take a few minutes)
library(GEOquery)
attemptsLeft <- 20
while(attemptsLeft>0)
{
  geoFile <- tryCatch(getGEOSuppFiles("GSE60361", makeDirectory=FALSE), error=identity) 
  if("error" %in% class(geoFile)) 
  {
    attemptsLeft <- attemptsLeft-1
    Sys.sleep(5)
  }
  else
    attemptsLeft <- 0
}

gzFile <- grep(".txt.gz", basename(rownames(geoFile)), fixed=TRUE, value=TRUE)
message(gzFile)
txtFile <- gsub(".gz", "", gzFile, fixed=TRUE)
message(txtFile)
gunzip(filename=gzFile, destname=txtFile, remove=TRUE)

library(data.table)
geoData <- fread(txtFile, sep="\t")
geneNames <- unname(unlist(geoData[,1, with=FALSE]))
exprMatrix <- as.matrix(geoData[,-1, with=FALSE])
rm(geoData)
dim(exprMatrix)
rownames(exprMatrix) <- geneNames
exprMatrix[1:5,1:4]

# Remove file
file.remove(txtFile)

# Save for future use
mouseBrainExprMatrix <- exprMatrix
save(mouseBrainExprMatrix, file="exprMatrix_AUCellVignette_MouseBrain.RData")

To speed-up the execution of the tutorial, we will use only 5000 random genes from this dataset.

# load("exprMatrix_AUCellVignette_MouseBrain.RData")
set.seed(333)
exprMatrix <- mouseBrainExprMatrix[sample(rownames(mouseBrainExprMatrix), 5000),]

In this way, we will be using an expression matrix with (5000 gene and 3005 cells):

dim(exprMatrix)
## [1] 5000 3005

Gene sets

The other input AUCell needs is the ‘gene-sets’ or signatures to test on the cells. They can be provided in several formats (see ?AUCell_calcAUC for examples). e.g.:

library(GSEABase)
genes <- c("gene1", "gene2", "gene3")
geneSets <- GeneSet(genes, setName="geneSet1")
geneSets
## setName: geneSet1 
## geneIds: gene1, gene2, gene3 (total: 3)
## geneIdType: Null
## collectionType: Null 
## details: use 'details(object)'

In this example we will use gene-sets representing diferent cell types in the brain:

  1. Big gene signatures (> 1000 genes) for astrocytes, oligodendrocytes and neurons.

Cahoy, J.D., et al. (2008). A Transcriptome Database for Astrocytes, Neurons, and Oligodendrocytes: A New Resource for Understanding Brain Development and Function. J. Neurosci. 28, 264–278. doi: 10.1523/JNEUROSCI.4178-07.2008

  1. Big gene signature (> 500 genes) for microglia. Obtained by comparing bulk RNA-seq profiles of microglia (brain-resident macrophages) to macrophages from other tissues.

Lein, E.S., et al. (2007). Genome-wide atlas of gene expression in the adult mouse brain. Nature 445, 168–176. doi: 10.1038/nature05453

  1. Small gene signatures (<100 genes) for astrocytes and neurons.

Lavin, Y., et al. (2014) Tissue-Resident Macrophage Enhancer Landscapes Are Shaped by the Local Microenvironment. Cell 159, 1312–1326. doi: 10.1016/j.cell.2014.11.018

library(AUCell)
library(GSEABase)
gmtFile <- paste(file.path(system.file('examples', package='AUCell')), "geneSignatures.gmt", sep="/")
geneSets <- getGmt(gmtFile)

Let’s check how many of these genes are in the expression matrix:

geneSets <- subsetGeneSets(geneSets, rownames(exprMatrix)) 
cbind(nGenes(geneSets))
##                       [,1]
## Astrocyte_Cahoy        555
## Neuron_Cahoy           375
## Oligodendrocyte_Cahoy  417
## Astrocyte_Lein          12
## Neuron_Lein             14
## Microglia_lavin        149

To ease the interpretation of the tutorial, we will also add the gene-set size into its name:

geneSets <- setGeneSetNames(geneSets, newNames=paste(names(geneSets), " (", nGenes(geneSets) ,"g)", sep=""))

For the example, let’s also add a few sets of random genes and 100 genes expressed in many cells (i.e. housekeeping-like):

# Random
set.seed(321)
extraGeneSets <- c(
  GeneSet(sample(rownames(exprMatrix), 50), setName="Random (50g)"),
  GeneSet(sample(rownames(exprMatrix), 500), setName="Random (500g)"))

countsPerGene <- apply(exprMatrix, 1, function(x) sum(x>0))
# Housekeeping-like
extraGeneSets <- c(extraGeneSets,
                   GeneSet(sample(names(countsPerGene)[which(countsPerGene>quantile(countsPerGene, probs=.95))], 100), setName="HK-like (100g)"))

geneSets <- GeneSetCollection(c(geneSets,extraGeneSets))
names(geneSets)
## [1] "Astrocyte_Cahoy (555g)"       "Neuron_Cahoy (375g)"         
## [3] "Oligodendrocyte_Cahoy (417g)" "Astrocyte_Lein (12g)"        
## [5] "Neuron_Lein (14g)"            "Microglia_lavin (149g)"      
## [7] "Random (50g)"                 "Random (500g)"               
## [9] "HK-like (100g)"

Since we are using an expression matrix with only 5000 random genes, most of these genes are acutally not available in the dataset. However, AUCell is robust enough to use this ‘noisy’ data.

1. Build gene-expression rankings for each cell

The first step to calculate the enrichment of a signature is to create the “rankings”. These rankings are only an intermediate step to calculate the AUC, but they are kept as a separate step in the workflow in order to provide more flexibility (i.e. to save them for future analyses, to merge datasets, or process them by parts).

For each cell, the genes are ranked from highest to lowest value. The genes with same expression value are shuffled. Therefore, genes with expression ‘0’ are randomly sorted at the end of the ranking. It is important to check that most cells have at least the number of expressed/detected genes that are going to be used to calculate the AUC (aucMaxRank in calcAUC()). The histogram provided by AUCell_buildRankings() allows to quickly check this distribution. plotGeneCount(exprMatrix) allows to obtain only the plot before building the rankings.

Since the rankings are created individually for each cell, in principle, it is possible to merge cell-rankings from different datasets. However, the datasets should be similar in regards to their “sensitivity” (e.g. the number of genes detected in the cells of each datasets), and the genes they include (e.g. same gene IDs).

cells_rankings <- AUCell_buildRankings(exprMatrix, nCores=1, plotStats=TRUE)
## Quantiles for the number of genes detected by cell: 
## (Non-detected genes are shuffled at the end of the ranking.Keep in mind when choosing the threshold for calculating the AUC).

##     min      1%      5%     10%     50%    100% 
##  185.00  255.12  356.00  432.00  902.00 2038.00
cells_rankings
## Ranking for 5000 genes (rows) and 3005 cells (columns).
## 
## Top-left corner of the ranking:
##                     cells
## genes                1772071015_C02 1772071017_G12 1772071017_A05
##   0610010B08Rik_loc4           2499           2954           2799
##   0610010K14Rik                3293            445           3485
##   0610031J06Rik                 862           1045           1102
##   0610037L13Rik                1047           2367           4809
##   0610039K10Rik                4672           4228           4199
##   0610040J01Rik                4034           1584           3729
##                     cells
## genes                1772071014_B06 1772067065_H06
##   0610010B08Rik_loc4           4932           1351
##   0610010K14Rik                3947            239
##   0610031J06Rik                4906            898
##   0610037L13Rik                3080            391
##   0610039K10Rik                3707           4924
##   0610040J01Rik                4353           4128
## 
## Quantiles for the number of genes detected by cell:
##     min      1%      5%     10%     50%    100% 
##  185.00  255.12  356.00  432.00  902.00 2038.00

The “rankings” can be seen as a new representation of the original dataset. Once they are calculated, they can be saved for future analyses.

save(cells_rankings, file="cells_rankings.RData")

2. Calculate enrichment for the gene signatures (AUC)

To determine whether the gene set is enriched at the top of the gene-ranking for each cell, AUCell uses the “Area Under the Curve” (AUC) of the recovery curve.

The function AUCell_calcAUC calculates this score, and returns a matrix with an AUC score for each gene-set in each cell.

cells_AUC <- AUCell_calcAUC(geneSets, cells_rankings)
save(cells_AUC, file="cells_AUC.RData")

In order to calculate the AUC, by default only the top 5% of the genes in the ranking are used (i.e. checks whether the genes in the gene-set or signature are within the top 5%). This allows faster execution on bigger datasets, and reduce the effect of the noise at the bottom of the ranking (e.g. where many genes might be tied at 0 counts). The percentage to take into account can be modified with the argument aucMaxRank. For datasets where most cells express many genes (e.g. a filtered dataset), or these have high expression values, it might be good to increase this threshold. Check the histogram provided by AUCell_buildRankings to get an estimation on where this threshold lies within the dataset.

3. Determine the cells with the given gene signatures or active gene sets

In summary: The AUC represents the proportion of expressed genes in the signature, and their relative expression value compared to the other genes within the cell. We can use this propperty to explore the population of cells that are present in the dataset according to the expression of the gene-set.

The AUC estimates the proportion of genes in the gene-set that are highly expressed in each cell. Cells expressing many genes from the gene-set will have higher AUC values than cells expressing fewer (i.e. compensating for housekeeping genes, or genes that are highly expressed in all the cells in the dataset). Since the AUC represents the proportion of expressed genes in the signature, we can use the relative AUCs across the cells to explore the population of cells that are present in the dataset according to the expression of the gene-set.

However, determining whether the signature is active (or not) in a given cell is not always trivial. The AUC is not an absolute value, but it depends on the the cell type (i.e. sell size, amount of transcripts), the specific dataset (i.e. sensitivity of the measures) and the gene-set. It is often not straight forward to obtain a pruned signature of clear marker genes that are completely “on” in the cell type of interest and off" in every other cell. In addition, at single-cell level, most genes are not expressed or detected at a constant level.

The ideal situation will be a bi-modal distribution, in which most cells in the dataset have a low “AUC” compared to a population of cells with a clearly higher value (i.e. see “Oligodendrocites” in the next figure). This is normally the case on gene sets that are active mostly in a population of cells with a good representation in the dataset (e.g. ~ 5-30% of cells in the dataset). Similar cases of “marker” gene sets but with different proportions of cells in the datasets are the “neurons” and “microglia” (see figure). When there are very few cells within the dataset, the distribution might look normal-like, but with some outliers to the higher end (e.g. microglia). While if the gene set is marker of a high percentage of cells in the dataset (i.e. neurons), the distribution might start approaching the look of a gene-set of housekeeping genes. As example, the ‘housekeeping’ gene-set in the figure includes genes that are detected in most cells in the dataset.

Note that the size of the gene-set will also affect the results. With smaller gene-genes (fewer genes), it is more likely to get cells with AUC = 0. While this is the case of the “perfect markers” it is also easier to get it by chance with smal datasets. (i.e. Random gene set with 50 genes in the figure). Bigger gene-sets (100-2k) can be more stable and easier to evaluate, as big random gene sets will approach the normal distibution.

To ease the exploration of the distributions, the function AUCell_exploreThresholds() automatically plots all the histograms and calculates several thresholds that could be used to consider a gene-set ‘active’ (returned in $aucThr). The distributions are plotted as dotted lines over the histogram and the corresponding thresholds as vertical bars in the matching color. The thicker vertical line indicates the threshold selected by default ($aucThr$selected): the highest value to reduce the false positives.

Note: This function makes use of package “mixtools” to explore the distributions. It is not essential, but we recommend to install it: source("http://bioconductor.org/biocLite.R"); biocLite("mixtools")

set.seed(123)
par(mfrow=c(3,3)) 
cells_assignment <- AUCell_exploreThresholds(cells_AUC, plotHist=TRUE, assign=TRUE)