Spatial transcriptomics (ST), named Method of the Year 2020 by Nature Methods in 2020, is a powerful and widely-used experimental method for profiling genome-wide gene expression across a tissue. In a typical ST experiment, fresh-frozen (or FFPE) tissue is sectioned and placed onto a slide containing spots, with each spot containing millions of capture oligonucleotides with spatial barcodes unique to that spot. The tissue is imaged, typically via Hematoxylin and Eosin (H&E) staining. Following imaging, the tissue is permeabilized to release mRNA which then binds to the capture oligonucleotides, generating a cDNA library consisting of transcripts bound by barcodes that preserve spatial information. Data from an ST experiment consists of the tissue image coupled with RNA-sequencing data collected from each spot. A first step in processing ST data is tissue detection, where spots on the slide containing tissue are distinguished from background spots without tissue. Unique molecular identifier (UMI) counts at each spot containing tissue are then used in downstream analyses.
Ideally, a gene-specific UMI at a given spot would represent expression of that gene at that spot, and spots without tissue would show no (or few) UMIs. This is not the case in practice. Messenger RNA bleed from nearby spots causes substantial contamination of UMI counts, an artifact we refer to as spot swapping. On average, we observe that more than 30% of UMIs at a tissue spot did not originate from this spot, but from other spots contaminating it. Spot swapping confounds downstream inferences including normalization, marker gene-based annotation, differential expression and cell type decomposition.
We developed SpotClean to adjust for the effects of spot swapping in ST experiments. SpotClean is able to measure the per-spot contamination rates in observed data and decontaminate gene expression levels, thus increases the sensitivity and precision of downstream analyses. Our package
SpotClean is built based on 10x Visium spatial transcriptomics experiments, currently the most widely-used commercial protocol, providing functions to load raw spatial transcriptomics data from 10x Space Ranger outputs, decontaminate the spot swapping effect, estimate contamination levels, visualize expression profiles and spot labels on the slide, and connect with other widely-used packages for further analyses. SpotClean can be potentially extended to other spatial transcriptomics data as long as the gene expression data in both tissue and background regions are available.
SpotClean is compatible with the
SpatialExperiment class and supports downstream analyses via
As a computational tool for analyzing spatial transcriptomics data,
SpotClean has been submitted to Bioconductor, making the R package well-documented and well-maintained for improved reproducibility and user experience.
Install from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("SpotClean")
Load package after installation:
Here we quickly demonstrate the general SpotClean workflow on the built-in example data. A step-by-step illustration can be found in the next section.
Note: codes in this chunk are only for illustrative purpose and are not runnable. Runnable codes can be found in the next section.
# Not run # Load 10x Visium data mbrain_raw <- read10xRaw("/path/to/matrix/folder") mbrain_slide_info <- read10xSlide("/path/to/tissue/csv", "/path/to/tissue/image", "/path/to/scale/factor") # Visualize raw data mbrain_obj <- createSlide(count_mat = mbrain_raw, slide_info = mbrain_slide_info) visualizeSlide(slide_obj = mbrain_obj) visualizeHeatmap(mbrain_obj,rownames(mbrain_raw)) # Decontaminate raw data decont_obj <- spotclean(mbrain_obj) # Visualize decontaminated gene visualizeHeatmap(decont_obj,rownames(mbrain_raw)) # Visualize the estimated per-spot contamination rate visualizeHeatmap(decont_obj,metadata(decont_obj)$contamination_rate, logged = FALSE, legend_title = "contamination rate", legend_range = c(0,1)) # (Optionally) Transform to Seurat object for downstream analyses seurat_obj <- convertToSeurat(decont_obj,image_dir = "/path/to/spatial/folder")
SpatialExperiment is another widely used S4 class for storing ST data, including the count matrix, cell and gene-level metadata, and tissue image.
spotclean() can be directly applied to
SpatialExperiment objects constructed from
SpatialExperiment::read10xVisium(), which reads in the raw data from 10x Visium Space Ranger output. The visualization functions in our package currently do not support
SpatialExperiment class, but there are alternative options such as
# Not run library(SpatialExperiment) slide_obj <- read10xVisium(samples = "/path/to/spaceranger/output/", data = "raw") # must specify data = "raw" decont_obj <- spotclean(slide_obj) str(assays(decont_obj)$decont)
The computational speed is related to the size and structure of input datasets, mainly driven by the number of tissue spots. SpotClean does not require parallel computation, and thus does not use up too many CPU or memory resources. As a reference, SpotClean running on a medium-size dataset (around 30,000 genes and 2,000 tissue spots) under default gene filtering takes less than 15 minutes.
Too many tissue spots (not enough background spots) While the observed data is a single matrix with a fixed number of columns (spots), the number of unknown parameters is proportional to the number of tissue spots. In the extreme case where all spots are covered by tissue, we have more unknown parameters than observed data values. In this case the contaminated expressions are confounded with true expressions, and SpotClean estimation becomes unreliable. We recommend that the input data have at least 25% spots not occupied by tissue, so that SpotClean has enough information from background spots to estimate contamination.
Lowly-expressed genes Lowly-expressed genes typically contain relatively less information and relatively more noise than highly-expressed genes. SpotClean by default only keeps highly-expressed or highly-variable genes for decontamination (or both). It can be forced to run on manually-specified lowly-expressed genes. However, even in this case, expression for the lowly-expressed genes is typically not changed very much. Given the high sparsity in most lowly expressed genes, there is not enough information available to confidently reassign UMIs in most cases. However, we do not filter genes by sparsity because there can be interesting genes highly concentrated in a small tissue region. In cases like this, SpotClean is effective at adjusting for spot swapping in these regions. If the defaults are not appropriate, users can either adjust the expression cutoffs or manually specify genes to decontaminate.
Inference based on sequencing depth SpotClean reassigns bled-out UMIs to their tissue spots of origin which changes the estimated sequencing depth of tissue spots after decontamination, since most estimations of sequencing depth rely on total expressions at every spot. As a result, decontamination can be considered as another type of normalization and might conflict with existing sequencing depth normalization methods.
SpotClean leads to improved estimates of expression by correcting for spot swapping. In other words, SpotClean reduces noise by enhancing signal in highly expressed regions and reducing measured signal in unexpressed regions. Consequently, SpotClean will improve the accuracy of marker gene-based inferences including tissue type annotation, cell type decomposition, integration with single-cell RNA-seq data, and associated downstream analyses. SpotClean also improves the identification of spatially variable and differentially expressed (DE) genes. We note that in some cases, the p-values associated with known DE genes may increase slightly following SpotClean due to increased variability within the spot groups (i.e. truly expressed regions become more highly expressed; and signal is removed from unexpressed regions).
SpotClean will not alter clusters substantially in most datasets given that clusters are largely determined by relatively few highly expressed genes. While clusters may become slightly better defined, in most cases we do not see big differences in the number of clusters and/or relationships among clusters after applying SpotClean.
Currently, the most widely-used spatial transcriptomics protocol is 10x Visium. Public 10x datasets can be found here. In this vignette, we illustrate the usage of
SpotClean using the V1_Adult_Mouse_Brain Visium data.
Two parts of 10x Space Ranger output files are required as input to
SpotClean: the raw gene-by-spot count matrix, and the slide metadata. In this example, you can download and unzip the Feature / cell matrix (raw) and Spatial imaging data from V1_Adult_Mouse_Brain. You will get a folder
matrix.mtx.gz, and a folder
scalefactors_json.json. The former folder contains the raw gene-by-spot count matrix, and the latter contains slide metadata like slide images and spot IDs and positions.
SpotClean provides functions
read10xSlide() to directly read these 10x Space Ranger output files and generate a gene-by-spot count matrix and slide metadata in R. Here for the purpose of demonstration, we have provided the count matrix and slide metadata with our package. The count matrix is a built-in example object
mbrain_raw, which is a subset of the gene-by-spot count matrix from
read10xRaw(), containing the top 100 highest expressed genes. The slide metadata, located at
extdata/V1_Adult_Mouse_Brain_spatial, contains the three basic files (
scalefactors_json.json) to run SpotClean and other associated packages in this vignette.
# load count matrix data(mbrain_raw) str(mbrain_raw)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ## ..@ i : int [1:457217] 0 1 2 3 4 5 6 7 8 9 ... ## ..@ p : int [1:4993] 0 99 199 299 399 499 592 692 788 857 ... ## ..@ Dim : int [1:2] 100 4992 ## ..@ Dimnames:List of 2 ## .. ..$ : chr [1:100] "Bc1" "Gm42418" "mt-Co3" "mt-Atp6" ... ## .. ..$ : chr [1:4992] "AAACAACGAATAGTTC-1" "AAACAAGTATCTCCCA-1" "AAACAATCTACTAGCA-1" "AAACACCAATAACTGC-1" ... ## ..@ x : num [1:457217] 141 266 217 202 106 102 84 50 56 44 ... ## ..@ factors : list()
# read spatial metadata spatial_dir <- system.file(file.path("extdata", "V1_Adult_Mouse_Brain_spatial"), package = "SpotClean") list.files(spatial_dir)
##  "scalefactors_json.json" "tissue_lowres_image.png" ##  "tissue_positions_list.csv"
mbrain_slide_info <- read10xSlide(tissue_csv_file=file.path(spatial_dir, "tissue_positions_list.csv"), tissue_img_file = file.path(spatial_dir, "tissue_lowres_image.png"), scale_factor_file = file.path(spatial_dir, "scalefactors_json.json")) str(mbrain_slide_info)
## List of 2 ## $ slide:'data.frame': 4992 obs. of 8 variables: ## ..$ barcode : chr [1:4992] "ACGCCTGACACGCGCT-1" "TACCGATCCAACACTT-1" "ATTAAAGCGGACGAGC-1" "GATAAGGGACGATTAG-1" ... ## ..$ tissue : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... ## ..$ row : int [1:4992] 0 1 0 1 0 1 0 1 0 1 ... ## ..$ col : int [1:4992] 0 1 2 3 4 5 6 7 8 9 ... ## ..$ imagerow: num [1:4992] 63.8 70 63.8 70 63.8 ... ## ..$ imagecol: num [1:4992] 61.8 65.3 68.8 72.3 75.8 ... ## ..$ height : int [1:4992] 600 600 600 600 600 600 600 600 600 600 ... ## ..$ width : int [1:4992] 576 576 576 576 576 576 576 576 576 576 ... ## $ grob :List of 12 ## ..$ raster : 'raster' chr [1:600, 1:576] "#81837E" "#80837E" "#81837E" "#81837E" ... ## ..$ x : 'simpleUnit' num 0.5npc ## .. ..- attr(*, "unit")= int 0 ## ..$ y : 'simpleUnit' num 0.5npc ## .. ..- attr(*, "unit")= int 0 ## ..$ width : 'simpleUnit' num 1npc ## .. ..- attr(*, "unit")= int 0 ## ..$ height : 'simpleUnit' num 1npc ## .. ..- attr(*, "unit")= int 0 ## ..$ just : chr "centre" ## ..$ hjust : NULL ## ..$ vjust : NULL ## ..$ interpolate: logi TRUE ## ..$ name : chr "GRID.rastergrob.11" ## ..$ gp : list() ## .. ..- attr(*, "class")= chr "gpar" ## ..$ vp : NULL ## ..- attr(*, "class")= chr [1:3] "rastergrob" "grob" "gDesc"
In the following, we will show the
SpotClean workflow using the built-in example data.
We combine count matrix and slide metadata together as one single slide object with class
slide_obj <- createSlide(mbrain_raw, mbrain_slide_info) slide_obj
## class: SummarizedExperiment ## dim: 100 4992 ## metadata(2): slide grob ## assays(1): raw ## rownames(100): Bc1 Gm42418 ... Rps25 Rps20 ## rowData names(0): ## colnames(4992): AAACAACGAATAGTTC-1 AAACAAGTATCTCCCA-1 ... ## TTGTTTGTATTACACG-1 TTGTTTGTGTAAATTC-1 ## colData names(0):
As shown above, the raw count matrix is stored in the
raw assay slot, and slide and image information are stored in
grob metadata slots.
Our package provides multiple visualization functions in the 2-D slide space. Function
visualizeSlide() shows the input slide imaging file (if given) in R: