Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of gene expression heterogeneity within complex biological systems. As scRNA-seq technology becomes increasingly accessible and cost-effective, experiments are generating data from larger and larger numbers of cells. However, the analysis of large scRNA-seq data remains a challenge, particularly in terms of scalability. While numerous analysis tools have been developed to tackle the complexities of scRNA-seq data, their scalability is often limited, posing a major bottleneck in the analysis of large-scale experiments. In particular, the R package Seurat is one of the most widely used tools for exploring and analyzing scRNA-seq data, but its scalability is often limited by available memory.
To address this issue, we introduce a new R package called “SCArray.sat” that extends the Seurat classes and functions to support Genomic Data Structure (GDS) files as a DelayedArray backend for data representation. GDS files store multiple dense and sparse array-based data sets in a hierarchical structure. This package defines a new class, called “SCArrayAssay” (derived from the Seurat class “Assay”), which wraps raw counts, normalized expressions, and scaled data matrices based on GDS-specific DelayedMatrix. It is designed to integrate seamlessly with the Seurat package to provide common data analysis in a workflow, with optimized algorithms for GDS data files.
~
~
The SeuratObject package defines the Assay
class with three members/slots counts
, data
and scale.data
storing raw counts, normalized expressions and scaled data matrix respectively. However, counts
and data
should be either a dense matrix or a sparse matrix defined in the Matrix package. The scalability of the sparse matrix is limited by the number of non-zero values (should be < 2^31), since the Matrix package uses 32-bit indices internally. scale.data
in the Assay
class is defined as a dense matrix, so it is also limited by the available memory. The new class SCArrayAssay
is derived from Assay
, with three additional slots counts2
, data2
and scale.data2
replacing the old ones. These new slots can be DelayedMatrix wrapping an on-disk data matrix, without loading the data in memory.
The SCArray.sat package takes advantage of the S3 object-oriented methods defined in the SeuratObject and Seurat packages to reduce code redundancy, by implementing the functions specific to the classes SCArrayAssay
and SC_GDSMatrix
(GDS-specific DelayedMatrix). Table 1 shows a list of key S3 methods for data analysis. For example, the function NormalizeData.SC_GDSMatrix()
will be called when a SC_GDSMatrix
object is passed to the S3 generic NormalizeData()
, while NormalizeData.Assay()
and NormalizeData.Seurat()
are unchanged. In addition, the SCArray and SCArray.sat packages implement the optimized algorithms for the calculations, by reducing the on-disk data access and taking the GDS data structure into account.
~
Table 1: Key S3 methods implemented in the SCArray.sat package.
Methods | Description | Note |
---|---|---|
GetAssayData.SCArrayAssay() | Accessor function for ‘SCArrayAssay’ objects | |
SetAssayData.SCArrayAssay | Setter functions for ‘Assay’ objects | |
NormalizeData.SC_GDSMatrix() | Normalize raw count data | Store a DelayedMatrix |
ScaleData.SC_GDSMatrix() | Scale and center the normalized data | |
FindVariableFeatures.SC_GDSMatrix() | Identifies features | |
RunPCA.SC_GDSMatrix() | Run a PCA dimensionality reduction |
SC_GDSMatrix: GDS- and single-cell- specific DelayedMatrix.
~
~
Requires SCArray (≥ v1.7.13), SeuratObject (≥ v4.0), Seurat (≥ v4.0)
Bioconductor repository
To install this package, start R and enter:
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("SCArray.sat")
# Load the packages
suppressPackageStartupMessages({
library(Seurat)
library(SCArray)
library(SCArray.sat)
})
# Input GDS file with raw counts
fn <- system.file("extdata", "example.gds", package="SCArray")
# show the file contents
(f <- scOpen(fn))
## Object of class "SCArrayFileClass"
## File: /home/biocbuild/bbs-3.19-bioc/R/site-library/SCArray/extdata/example.gds (504.7K)
## + [ ] *
## |--+ feature.id { Str8 1000 LZMA_ra(59.0%), 3.7K }
## |--+ sample.id { Str8 850 LZMA_ra(13.2%), 1.8K }
## |--+ counts { SparseReal32 1000x850 LZMA_ra(12.4%), 495.3K }
## |--+ feature.data [ ]
## |--+ sample.data [ ]
## | |--+ Cell_ID { Str8 850 LZMA_ra(13.2%), 1.8K }
## | |--+ Cell_type { Str8 850 LZMA_ra(2.98%), 165B }
## | \--+ Timepoint { Str8 850 LZMA_ra(3.73%), 229B }
## \--+ meta.data [ ]
scClose(f) # close the file
# Create a Seurat object from the GDS file
d <- scNewSeuratGDS(fn)
## Input: /home/biocbuild/bbs-3.19-bioc/R/site-library/SCArray/extdata/example.gds
## counts: 1000 x 850
class(GetAssay(d)) # SCArrayAssay, derived from Assay
## [1] "SCArrayAssay"
## attr(,"package")
## [1] "SCArray.sat"
d <- NormalizeData(d)
## Performing log-normalization
d <- FindVariableFeatures(d, nfeatures=500)
## Calculating gene variances
## Calculating feature variances of standardized and clipped values
d <- ScaleData(d)
## Centering and scaling data matrix (SC_GDSMatrix [500x850])
Let’s check the internal data matrices,
# get the file name for the on-disk object
scGetFiles(d)
## [1] "/home/biocbuild/bbs-3.19-bioc/R/site-library/SCArray/extdata/example.gds"
# raw counts
m <- GetAssayData(d, slot="counts")
scGetFiles(m) # the file name storing raw count data
## [1] "/home/biocbuild/bbs-3.19-bioc/R/site-library/SCArray/extdata/example.gds"
m
## <1000 x 850> sparse SC_GDSMatrix object of type "double":
## 1772122_301_C02 1772122_180_E05 ... 1772122_180_B06
## MRPL20 2.133695 1.423133 . 0.000000
## GNB1 3.342935 2.346631 . 0.000000
## RPL22 2.133695 2.183267 . 2.982560
## PARK7 1.247608 2.487018 . 1.980463
## ENO1 3.037644 3.431596 . 3.129447
## ... . . . .
## SSR4 0.000000 2.346631 . 2.810320
## RPL10 3.342935 1.987902 . 1.416593
## SLC25A6-loc1 2.391330 2.183267 . 2.338835
## RPS4Y1 0.000000 2.183267 . 1.980463
## CD24 3.821575 1.744869 . 0.000000
## 1772122_180_D09
## MRPL20 1.757196
## GNB1 0.000000
## RPL22 2.733620
## PARK7 1.757196
## ENO1 2.360130
## ... .
## SSR4 1.223211
## RPL10 2.103432
## SLC25A6-loc1 1.223211
## RPS4Y1 2.360130
## CD24 1.757196
# normalized expression
# the normalized data does not save in neither the file nor the memory
GetAssayData(d, slot="data")
## <1000 x 850> sparse SC_GDSMatrix object of type "double":
## 1772122_301_C02 1772122_180_E05 ... 1772122_180_B06
## MRPL20 2.133695 1.423133 . 0.000000
## GNB1 3.342935 2.346631 . 0.000000
## RPL22 2.133695 2.183267 . 2.982560
## PARK7 1.247608 2.487018 . 1.980463
## ENO1 3.037644 3.431596 . 3.129447
## ... . . . .
## SSR4 0.000000 2.346631 . 2.810320
## RPL10 3.342935 1.987902 . 1.416593
## SLC25A6-loc1 2.391330 2.183267 . 2.338835
## RPS4Y1 0.000000 2.183267 . 1.980463
## CD24 3.821575 1.744869 . 0.000000
## 1772122_180_D09
## MRPL20 1.757196
## GNB1 0.000000
## RPL22 2.733620
## PARK7 1.757196
## ENO1 2.360130
## ... .
## SSR4 1.223211
## RPL10 2.103432
## SLC25A6-loc1 1.223211
## RPS4Y1 2.360130
## CD24 1.757196
# scaled and centered data matrix
# in this example, the scaled data does not save in neither the file nor the memory
GetAssayData(d, slot="scale.data")
## <1000 x 850> sparse SC_GDSMatrix object of type "double":
## 1772122_301_C02 1772122_180_E05 ... 1772122_180_B06
## MRPL20 2.133695 1.423133 . 0.000000
## GNB1 3.342935 2.346631 . 0.000000
## RPL22 2.133695 2.183267 . 2.982560
## PARK7 1.247608 2.487018 . 1.980463
## ENO1 3.037644 3.431596 . 3.129447
## ... . . . .
## SSR4 0.000000 2.346631 . 2.810320
## RPL10 3.342935 1.987902 . 1.416593
## SLC25A6-loc1 2.391330 2.183267 . 2.338835
## RPS4Y1 0.000000 2.183267 . 1.980463
## CD24 3.821575 1.744869 . 0.000000
## 1772122_180_D09
## MRPL20 1.757196
## GNB1 0.000000
## RPL22 2.733620
## PARK7 1.757196
## ENO1 2.360130
## ... .
## SSR4 1.223211
## RPL10 2.103432
## SLC25A6-loc1 1.223211
## RPS4Y1 2.360130
## CD24 1.757196
Perform PCA and UMAP:
d <- RunPCA(d, ndims.print=1:2)
## Run PCA on the scaled data matrix ...
## Calculating the covariance matrix [500x500] ...
## PC_ 1
## Positive: NPM1, RPLP1, RPL35, HNRNPA1P10, RPS20, RPS6, RPS19, RPS3, HMGB2, RPL32
## RPL31P11-p1, HSPE1-MOB4, RPS23, HMGN2, RPS10-NUDT3, SNRPE, RPS24, CKS1B, H2AFZ, RPS14
## RPA3, RPL18A, RPS18-loc6, EEF1B2, SHFM1, TMA7, KIAA0101, RPS3A, RPL37A, SNRPG
## Negative: DCX, STMN2, MAP1B, NCAM1, GAP43, RTN1, BASP1, KIF5C, DPYSL3, DCC
## MIAT, TTC3, MALAT1, CRMP1, SOX11, TUBB3, GPM6A, TUBA1A, WSB1, TUBB2B
## RTN4, NNAT, SCG2, TUBB2A, MAP2, SEZ6L2, ONECUT2, MAP6, ENO2, CNTN2
## PC_ 2
## Positive: TUBA1B, NUCKS1, HNRNPA2B1, MARCKSL1, MARCKS, HNRNPD, NES, HNRNPA1, KHDRBS1, LOC644936-p1
## CKB, SET, MIR1244-3-loc4, TUBA1C, SNORD38A, DEK, SOX11, SFPQ, HNRNPU, IGF2BP1
## CBX5, NASP, RPS17-loc1, SMC4, RPS17-loc2, RPL41, CENPF, HMGB1, HDAC2, RRM1
## Negative: RPL13AP5, RPL31P11-p1, RPS14, RPS3A, TTR, RPL37A, RPL18A, PMCH, RPLP1, RPS19
## RPL23A, RPS3, OLFM3, ANXA2, RPL32, RPS13, SULF1, CDO1, TRPM3, COL1A1
## RPL18, MTRNR2L8, RNA5-8S5-loc2, MIR611, MALAT1, HTR2C, RNA5-8S5-loc1, RPS25, HES1, LDHA
DimPlot(d, reduction="pca")