NeuCA Package User’s Guide

Ziyi Li1 and Hao Feng2*

1Department of Biostatistics, The University of Texas MD Anderson Cancer Center
2Department of Population and Quantitative Health Sciences, Case Western Reserve University

*hxf155@case.edu

19 April 2024

Abstract

NEUral-network based Cell Annotation, NeuCA, is a tool for cell type annotation using single-cell RNA-seq data. It is a supervised cell label assignment method that uses existing scRNA-seq data with known labels to train a neural network-based classifier, and then predict cell labels in single-cell RNA-seq data of interest.

Package

NeuCA 1.9.2

1 Introduction
2 Preparing NeuCA input files: SingleCellExperiment class
3 NeuCA training and prediction
4 Predicted cell types
Session info

1 Introduction

The fast advancing single cell RNA sequencing (scRNA-seq) technology enables transcriptome study in heterogeneous tissues at a single cell level. The initial important step of analyzing scRNA-seq data is to accurately annotate cell labels. We present a neural-network based cell annotation method NeuCA. When closely correlated cell types exist, NeuCA uses the cell type tree information through a hierarchical structure of neural networks to improve annotation accuracy. Feature selection is performed in hierarchical structure to further improve classification accuracy. When cell type correlations are not high, a feed-forward neural network is adopted.

NeuCA depends on the following packages:

keras, for neural-network interface in R,
limma, for linear model framework and testing markers,
SingleCellExperiment, for data organization formatting,
e1071, for probability and predictive functions.

2 Preparing NeuCA input files: `SingleCellExperiment` class

The scRNA-seq data input for NeuCA must be objects of the Bioconductor SingleCellExperiment. You may need to read corresponding vignettes on how to create a SingleCellExperiment from your own data. An example is provided here to show how to do that, but please note this is not a comprehensive guidance for SingleCellExperiment.

Step 1: Load in example scRNA-seq data.

We are using two example datasets here: Baron_scRNA and Seg_scRNA. Baron_scRNA is a droplet(inDrop)-based, single-cell RNA-seq data generated from pancrease (Baron et al.). Around 10,000 human and 2,000 mouse pancreatic cells from four cadaveric donors and two strains of mice were sequenced. Seg_scRNA is a Smart-Seq2 based, single-cell RNA-seq dataset (Segerstolpe et al.). It has thousands of human islet cells from healthy and type-2 diabetic donors. A total of 3,386 cells were collected, with around 350 cells from each donor. Here, subsets of these two datasets (with cell type labels for each cell) were included as examples.

library(NeuCA)
data("Baron_scRNA")
data("Seg_scRNA")

Step 2a: Prepare training data as a SingleCellExperiment object.

Baron_anno = data.frame(Baron_true_cell_label, row.names = colnames(Baron_counts))
Baron_sce = SingleCellExperiment(
    assays = list(normcounts = as.matrix(Baron_counts)),
    colData = Baron_anno
    )
# use gene names as feature symbols
rowData(Baron_sce)$feature_symbol <- rownames(Baron_sce)
# remove features with duplicated names
Baron_sce <- Baron_sce[!duplicated(rownames(Baron_sce)), ]

Step 2b: Similarly, prepare testing data as a SingleCellExperiment object. Note the true cell type labels are not necessary (and of course often not available).

Seg_anno = data.frame(Seg_true_cell_label, row.names = colnames(Seg_counts))
Seg_sce <- SingleCellExperiment(
    assays = list(normcounts = as.matrix(Seg_counts)),
    colData = Seg_anno
)
# use gene names as feature symbols
rowData(Seg_sce)$feature_symbol <- rownames(Seg_sce)
# remove features with duplicated names
Seg_sce <- Seg_sce[!duplicated(rownames(Seg_sce)), ]

3 NeuCA training and prediction

Step 3: with both training and testing data as objects in SingleCellExperiment class, now we can train the classifier in NeuCA and predict testing dataset’s cell types. This process can be achieved with one line of code:

predicted.label = NeuCA(train = Baron_sce, test = Seg_sce, 
                        model.size = "big", verbose = FALSE)
#Baron_scRNA is used as the training scRNA-seq dataset
#Seg_scRNA is used as the testing scRNA-seq dataset

NeuCA can detect whether highly-correlated cell types exist in the training dataset, and automatically determine if a general neural-network model will be adopted or a marker-guided hierarchical neural-network will be adopted for classification.

[Tuning parameter] In neural-network, the numbers of layers and nodes are tunable parameters. Users have the option to determine the complexity of the neural-network used in NeuCA by specifying the desired model.size argument. Here, “big”, “medium” and “small” are 3 possible choices, reflecting large, medium and small number of nodes and layers in neural-network, respectively. The model size details are shown in the following Table 1. From our experience, “big” or “medium” can often produce high accuracy predictions.

Table 1: Table 2: Tuning model sizes in the neural-network classifier training.
	Number of layers	Number of nodes in hidden layers
Small	3	64
Medium	4	64,128
Big	5	64,128,256

4 Predicted cell types

predicted.label is a vector of the same length with the number of cells in the testing dataset, containing all cell’s predicted cell type. It can be viewed directly:

head(predicted.label)

## [1] "alpha" "gamma" "gamma" "gamma" "gamma" "alpha"

table(predicted.label)

## predicted.label
##       alpha        beta       delta      ductal endothelial       gamma 
##         328         109          56          65           9         135

[Optional] If you have the true cell type labels for the testing dataset, you may evaluate the predictive performance by a confusion matrix:

table(predicted.label, Seg_true_cell_label)

##                Seg_true_cell_label
## predicted.label alpha beta delta ductal endothelial gamma
##     alpha         328    0     0      0           0     0
##     beta            0  109     0      0           0     0
##     delta           0    0    56      0           0     0
##     ductal          1    0     0     64           0     0
##     endothelial     0    0     0      0           9     0
##     gamma           0    0     0      0           0   135

You may also draw a Sankey diagram to visualize the prediction accuracy:

Session info

## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] networkD3_0.4               knitr_1.46                 
##  [3] NeuCA_1.9.2                 kableExtra_1.4.0           
##  [5] SingleCellExperiment_1.25.1 SummarizedExperiment_1.33.3
##  [7] Biobase_2.63.1              GenomicRanges_1.55.4       
##  [9] GenomeInfoDb_1.39.14        IRanges_2.37.1             
## [11] S4Vectors_0.41.6            BiocGenerics_0.49.1        
## [13] MatrixGenerics_1.15.1       matrixStats_1.3.0          
## [15] e1071_1.7-14                limma_3.59.8               
## [17] keras_2.13.0                BiocStyle_2.31.0           
## 
## loaded via a namespace (and not attached):
##  [1] xfun_0.43               bslib_0.7.0             htmlwidgets_1.6.4      
##  [4] lattice_0.22-6          vctrs_0.6.5             tools_4.4.0            
##  [7] tfruns_1.5.3            generics_0.1.3          proxy_0.4-27           
## [10] highr_0.10              pkgconfig_2.0.3         Matrix_1.7-0           
## [13] lifecycle_1.0.4         GenomeInfoDbData_1.2.12 compiler_4.4.0         
## [16] stringr_1.5.1           statmod_1.5.0           munsell_0.5.1          
## [19] htmltools_0.5.8.1       class_7.3-22            sass_0.4.9             
## [22] yaml_2.3.8              crayon_1.5.2            jquerylib_0.1.4        
## [25] whisker_0.4.1           cachem_1.0.8            DelayedArray_0.29.9    
## [28] abind_1.4-5             digest_0.6.35           stringi_1.8.3          
## [31] bookdown_0.39           fastmap_1.1.1           grid_4.4.0             
## [34] colorspace_2.1-0        cli_3.6.2               SparseArray_1.3.5      
## [37] magrittr_2.0.3          S4Arrays_1.3.7          base64enc_0.1-3        
## [40] UCSC.utils_0.99.7       scales_1.3.0            rmarkdown_2.26         
## [43] XVector_0.43.1          httr_1.4.7              igraph_2.0.3           
## [46] reticulate_1.36.0       png_0.1-8               evaluate_0.23          
## [49] viridisLite_0.4.2       rlang_1.1.3             Rcpp_1.0.12            
## [52] zeallot_0.1.0           glue_1.7.0              xml2_1.3.6             
## [55] BiocManager_1.30.22     svglite_2.1.3           rstudioapi_0.16.0      
## [58] jsonlite_1.8.8          R6_2.5.1                systemfonts_1.0.6      
## [61] zlibbioc_1.49.3         tensorflow_2.16.0