Contents

1 Setup

library(EWCE)
library(ggplot2)
set.seed(1234)

NOTE: This documentation is for the development version of EWCE. See Bioconductor for documentation on the current release version.

2 Run cell-type enrichment tests

2.1 Introduction

In the following vignette, we provide more a more in-depth version of the examples provided in the Getting started vignette.

2.2 Prepare input data

2.2.1 CellTypeDataset

For this example we use a subset of the genes from the merged dataset generated in the Create a CellTypeDataset section below, which is accessed using ewceData::ctd().

2.2.1.1 CTD levels

Each level of a CTD corresponds to increasingly refined cell-type/-subtype annotations. For example, in the CTD ewceData::ctd() level 1 includes the cell-type “interneurons”, while level 2 breaks these this group into 16 different interneuron subtypes (“Int…”).

## Load merged cortex and hypothalamus dataset generated by Karolinska institute
ctd <- ewceData::ctd() # i.e. ctd_MergedKI
## see ?ewceData and browseVignettes('ewceData') for documentation
## loading from cache
try({
  plt <- EWCE::plot_ctd(ctd = ctd,
                        level = 1,
                        genes = c("Apoe","Gfap","Gapdh"),
                        metric = "mean_exp")
})

Note - You can load ewceData::ctd() offline by passing localhub = TRUE. This will work off a previously cached version of the reference dataset from ExperimentHub.

2.2.2 Gene lists

For the first demonstration of EWCE we will test for whether genes that are genetically associated with Alzheimer’s disease are enriched in any particular cell type.

This example gene list is stored within the ewceData package:

hits <- ewceData::example_genelist()
## see ?ewceData and browseVignettes('ewceData') for documentation
## loading from cache
print(hits)
##  [1] "APOE"     "BIN1"     "CLU"      "ABCA7"    "CR1"      "PICALM"  
##  [7] "MS4A6A"   "CD33"     "MS4A4E"   "CD2AP"    "EOGA1"    "INPP5D"  
## [13] "MEF2C"    "HLA-DRB5" "ZCWPW1"   "NME8"     "PTK2B"    "CELF1"   
## [19] "SORL1"    "FERMT2"   "SLC24A4"  "CASS4"

Note - You can load ewceData::example_genelist() offline by passing localhub = TRUE. This will work off a previously cached version of the reference dataset from ExperimentHub.

2.2.2.1 Gene formats and species

All gene IDs are assumed by the package to be provided in gene symbol format (rather than Ensembl/Entrez). Symbols can be provided as any species-specific gene symbols supported by the package orthogene, though the genelistSpecies argument will need to be set appropriately.

Likewise, the single-cell dataset can be from any species, but the sctSpecies argument must be set accordingly.

The example gene list here stores the human genes associated with human disease, and hence are HGNC symbols.

The next step is to determine the most suitable background set. This can be user-supplied, but by default the background is all 1:1 ortholog genes shared by genelistSpecies and sctSpecies that are also present in sct_data.

2.2.2.2 Notes on orthogene

orthogene substantially improves upon previous ortholog translations that used the static ewceData::mouse_to_human_homologs() file as the former is updated using the Homologene database periodically.

exp <- ctd[[1]]$mean_exp

#### Old conversion method ####
#if running offline pass localhub = TRUE
m2h <- ewceData::mouse_to_human_homologs()
## see ?ewceData and browseVignettes('ewceData') for documentation
## loading from cache
exp_old <- exp[rownames(exp) %in% m2h$MGI.symbol,]

#### New conversion method (used by EWCE internally) ####
exp_new <- orthogene::convert_orthologs(gene_df = exp,
                                        input_species = "mouse", 
                                        output_species = "human", 
                                        method = "homologene")
## Preparing gene_df.
## Dense matrix format detected.
## Extracting genes from rownames.
## 15,259 genes extracted.
## Converting mouse ==> human orthologs using: homologene
## Retrieving all organisms available in homologene.
## Mapping species name: mouse
## Common name mapping found for mouse
## 1 organism identified from search: 10090
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Checking for genes without orthologs in human.
## Extracting genes from input_gene.
## 13,416 genes extracted.
## Extracting genes from ortholog_gene.
## 13,416 genes extracted.
## Checking for genes without 1:1 orthologs.
## Dropping 46 genes that have multiple input_gene per ortholog_gene (many:1).
## Dropping 56 genes that have multiple ortholog_gene per input_gene (1:many).
## Filtering gene_df with gene_map
## Setting ortholog_gene to rownames.
## 
## =========== REPORT SUMMARY ===========
## Total genes dropped after convert_orthologs :
##    2,016 / 15,259 (13%)
## Total genes remaining after convert_orthologs :
##    13,243 / 15,259 (87%)
#### Report ####
message("The new method retains ",
        formatC(nrow(exp_new) - nrow(exp_old), big.mark = ","),
        " more genes than the old method.")
## The new method retains 918 more genes than the old method.

orthogene is also used internally to standardise gene lists supplied to EWCE functions, such as EWCE::bootstrap_enrichment_test(hits = <gene_list>).

Not only can it map these gene lists across species, but it can also map them within species. For example, if you provide a list of Ensembl IDs, it will automatically convert them to standardised HGNC gene symbols so they’re compatible with the similarly standardised CellTypeDataset.

2.2.3 Setting analysis parameters

We now need to set the parameters for the analysis. For a publishable analysis we would want to generate over 10,000 random lists and determine their expression levels, but for computational speed let us only use reps=100. We want to analyse level 1 annotations so set level to 1.

# Use 100 bootstrap lists for speed, for publishable analysis use >=10000
reps <- 100 
# Use level 1 annotations (i.e. Interneurons)
annotLevel <- 1 

2.3 Enrichment tests

2.3.1 Default tests

We have now loaded the SCT data, prepared the gene lists and set the parameters. We run the model as follows.

Note: We set the seed at the top of this vignette to ensure reproducibility in the bootstrap sampling function.

2.3.1.1 Parallelisation

You can now speed up the bootstrapping process by parallelising across multiple cores with the parameter no_cores (=1 by default).

# Bootstrap significance test, no control for transcript length and GC content 
full_results <- EWCE::bootstrap_enrichment_test(sct_data = ctd,
                                                sctSpecies = "mouse",
                                                genelistSpecies = "human",
                                                hits = hits, 
                                                reps = reps,
                                                annotLevel = annotLevel)
## 1 core(s) assigned as workers (71 reserved).
## Generating gene background for mouse x human ==> human
## Gathering ortholog reports.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Gene table with 19,129 rows retrieved.
## Returning all 19,129 genes from human.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: mouse
## Common name mapping found for mouse
## 1 organism identified from search: 10090
## Gene table with 21,207 rows retrieved.
## Returning all 21,207 genes from mouse.
## --
## --
## Preparing gene_df.
## data.frame format detected.
## Extracting genes from Gene.Symbol.
## 21,207 genes extracted.
## Converting mouse ==> human orthologs using: homologene
## Retrieving all organisms available in homologene.
## Mapping species name: mouse
## Common name mapping found for mouse
## 1 organism identified from search: 10090
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Checking for genes without orthologs in human.
## Extracting genes from input_gene.
## 17,355 genes extracted.
## Extracting genes from ortholog_gene.
## 17,355 genes extracted.
## Checking for genes without 1:1 orthologs.
## Dropping 131 genes that have multiple input_gene per ortholog_gene (many:1).
## Dropping 498 genes that have multiple ortholog_gene per input_gene (1:many).
## Filtering gene_df with gene_map
## Adding input_gene col to gene_df.
## Adding ortholog_gene col to gene_df.
## 
## =========== REPORT SUMMARY ===========
## Total genes dropped after convert_orthologs :
##    4,725 / 21,207 (22%)
## Total genes remaining after convert_orthologs :
##    16,482 / 21,207 (78%)
## --
## 
## =========== REPORT SUMMARY ===========
## 16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
## 16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
## Gathering ortholog reports.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Gene table with 19,129 rows retrieved.
## Returning all 19,129 genes from human.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Gene table with 19,129 rows retrieved.
## Returning all 19,129 genes from human.
## --
## 
## =========== REPORT SUMMARY ===========
## 19,129 / 19,129 (100%) target_species genes remain after ortholog conversion.
## 19,129 / 19,129 (100%) reference_species genes remain after ortholog conversion.
## 16,482 intersect background genes used.
## Standardising CellTypeDataset
## Checking gene list inputs.
## Running without gene size control.
## 17 hit gene(s) remain after filtering.
## Computing gene scores.
## Using previously sampled genes.
## Computing gene counts.
## Testing for enrichment in 7 cell types...
## Sorting results by p-value.
## Computing BH-corrected q-values.
## 1 significant cell type enrichment results @ q<0.05 :
##    CellType annotLevel p fold_change sd_from_mean q
## 1 microglia          1 0    1.965915     3.938119 0

Note - You can run bootstrap_enrichment_test() offline by passing localhub = TRUE. This will work off a previously cached version of the reference dataset from ExperimentHub.

A note on both the background and target gene lists, other common gene list objects can be used as inputs such as BiocSet::BiocSet and GSEABase::GeneSet. Below is an example of how to format each for the target gene list (hits):

if(!"BiocSet" %in% rownames(installed.packages())) {
  BiocManager::install("BiocSet")
}
if(!"GSEABase" %in% rownames(installed.packages())) {
  BiocManager::install("GSEABase")
}

library(BiocSet)
library(GSEABase)

# Save both approaches as hits which will be passed to bootstrap_enrichment_test
genes <- c("Apoe","Inpp5d","Cd2ap","Nme8",
          "Cass4","Mef2c","Zcwpw1","Bin1",
          "Clu","Celf1","Abca7","Slc24a4",
          "Ptk2b","Picalm","Fermt2","Sorl1")

#BiocSet::BiocSet, BiocSet_target contains the gene list target
BiocSet_target <- BiocSet::BiocSet(set1 = genes) 
hits <- unlist(BiocSet::es_element(BiocSet_target))

#GSEABase::GeneSet, GeneSet_target contains the gene list target
GeneSet_target <- GSEABase::GeneSet(genes)
hits <- GSEABase::geneIds(GeneSet_target) 

The main table of results is stored in full_results$results. We can see the most significant results using:

knitr::kable(full_results$results)
CellType annotLevel p fold_change sd_from_mean q
microglia microglia 1 0.00 1.9659148 3.9381188 0.000
astrocytes_ependymal astrocytes_ependymal 1 0.13 1.2624889 1.1553910 0.455
pyramidal_SS pyramidal_SS 1 0.80 0.8699242 -0.8226268 1.000
oligodendrocytes oligodendrocytes 1 0.87 0.7631149 -1.0861761 1.000
pyramidal_CA1 pyramidal_CA1 1 0.89 0.8202496 -1.1738063 1.000
endothelial_mural endothelial_mural 1 0.90 0.7674534 -1.1811797 1.000
interneurons interneurons 1 1.00 0.4012954 -3.4703413 1.000

2.3.1.2 Plot results

The results can be visualised using another function, which shows for each cell type, the number of standard deviations from the mean the level of expression was found to be in the target gene list, relative to the bootstrapped mean:

try({
  plot_list <- EWCE::ewce_plot(total_res = full_results$results,
                             mtc_method ="BH",
                               ctd = ctd) 
  # print(plot_list$plain)
})

For publications it can be useful to plot a dendrogram alongside the plot. This can be done by including the cell type data as an additional argument. The dendrogram should automatically align with the graph ticks (thanks to Robert Gordon-Smith for this solution):

print(plot_list$withDendro)  

If you want to view the characteristics of enrichment for each gene within the list then the generate_bootstrap_plots function should be used. This saves the plots into the BootstrapPlots folder. This takes the results of a bootstrapping analysis so as to only generate plots for significant enrichments. The listFileName argument is used to give the generated graphs a particular file name. The savePath argument is used here to save the files to a temporary directory, this can be updated to your preferred location. The file path where it was saved is returned so the temporary directory can be located if used.

bt_plot_location <- EWCE::generate_bootstrap_plots(
  sct_data = ctd,
  hits = hits,
  sctSpecies = "mouse",
  genelistSpecies = "human",
  reps = reps,
  annotLevel = annotLevel,
  full_results = full_results)

2.3.2 Control for transcript length and GC-content

When analysing genes found through genetic association studies it is important to consider biases which might be introduced as a result of transcript length and GC-content. The package can control for these by selecting the bootstrap lists such that the ith gene in the random list has properties similar to theith gene in the target list. To enable the algorithm to do this it needs to be passed the gene lists as HGNC symbols rather than MGI.

The bootstrapping function then takes different arguments:

# Bootstrap significance test controlling for transcript length and GC content
#if running offline pass localhub = TRUE
cont_results <- EWCE::bootstrap_enrichment_test(
  sct_data = ctd,
  hits = hits,
  sctSpecies = "mouse",
  genelistSpecies = "human", 
  reps = reps,
  annotLevel = annotLevel,
  geneSizeControl = TRUE)
## 1 core(s) assigned as workers (71 reserved).
## Generating gene background for mouse x human ==> human
## Gathering ortholog reports.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Gene table with 19,129 rows retrieved.
## Returning all 19,129 genes from human.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: mouse
## Common name mapping found for mouse
## 1 organism identified from search: 10090
## Gene table with 21,207 rows retrieved.
## Returning all 21,207 genes from mouse.
## --
## --
## Preparing gene_df.
## data.frame format detected.
## Extracting genes from Gene.Symbol.
## 21,207 genes extracted.
## Converting mouse ==> human orthologs using: homologene
## Retrieving all organisms available in homologene.
## Mapping species name: mouse
## Common name mapping found for mouse
## 1 organism identified from search: 10090
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Checking for genes without orthologs in human.
## Extracting genes from input_gene.
## 17,355 genes extracted.
## Extracting genes from ortholog_gene.
## 17,355 genes extracted.
## Checking for genes without 1:1 orthologs.
## Dropping 131 genes that have multiple input_gene per ortholog_gene (many:1).
## Dropping 498 genes that have multiple ortholog_gene per input_gene (1:many).
## Filtering gene_df with gene_map
## Adding input_gene col to gene_df.
## Adding ortholog_gene col to gene_df.
## 
## =========== REPORT SUMMARY ===========
## Total genes dropped after convert_orthologs :
##    4,725 / 21,207 (22%)
## Total genes remaining after convert_orthologs :
##    16,482 / 21,207 (78%)
## --
## 
## =========== REPORT SUMMARY ===========
## 16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
## 16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
## Gathering ortholog reports.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Gene table with 19,129 rows retrieved.
## Returning all 19,129 genes from human.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Gene table with 19,129 rows retrieved.
## Returning all 19,129 genes from human.
## --
## 
## =========== REPORT SUMMARY ===========
## 19,129 / 19,129 (100%) target_species genes remain after ortholog conversion.
## 19,129 / 19,129 (100%) reference_species genes remain after ortholog conversion.
## 16,482 intersect background genes used.
## Standardising CellTypeDataset
## Checking gene list inputs.
## Running with gene size control.
## Warning: sctSpecies_origin not provided. Setting to 'mouse' by default.
## Retrieving all genes using: gprofiler
## Retrieving all organisms available in gprofiler.
## Using stored `gprofiler_orgs`.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: hsapiens
## Gene table with 63,150 rows retrieved.
## Returning all 63,150 genes from human.
## 63,150 human Ensembl IDs and 40,977 human Gene Symbols imported.
## see ?ewceData and browseVignettes('ewceData') for documentation
## loading from cache
## Controlled bootstrapping network generated.
## 17 hit gene(s) remain after filtering.
## Computing gene scores.
## Using previously sampled genes.
## Computing gene counts.
## Testing for enrichment in 7 cell types...
## Sorting results by p-value.
## Computing BH-corrected q-values.
## 0 significant cell type enrichment results @ q<0.05 :

2.3.2.1 Plot results

We plot these results using ewce_plot:

try({
  plot_list <- EWCE::ewce_plot(total_res = cont_results$results,
                             mtc_method = "BH")
  print(plot_list$plain)
})

This shows that the controlled method generates enrichments that are generally comparable to the standard method.

2.3.3 Test different CTD levels

Both the analyses shown above were run on level 1 annotations. It is possible to test on the level 2 cell type level annotations by changing one of the arguments.

# Bootstrap significance test controlling for transcript length and GC content
#if running offline pass localhub = TRUE
cont_results <- EWCE::bootstrap_enrichment_test(sct_data = ctd,
                                                hits = hits, 
                                                sctSpecies = "mouse",
                                                genelistSpecies = "human",
                                                reps = reps,
                                                annotLevel = 2,
                                                geneSizeControl = TRUE)
## 1 core(s) assigned as workers (71 reserved).
## Generating gene background for mouse x human ==> human
## Gathering ortholog reports.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Gene table with 19,129 rows retrieved.
## Returning all 19,129 genes from human.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: mouse
## Common name mapping found for mouse
## 1 organism identified from search: 10090
## Gene table with 21,207 rows retrieved.
## Returning all 21,207 genes from mouse.
## --
## --
## Preparing gene_df.
## data.frame format detected.
## Extracting genes from Gene.Symbol.
## 21,207 genes extracted.
## Converting mouse ==> human orthologs using: homologene
## Retrieving all organisms available in homologene.
## Mapping species name: mouse
## Common name mapping found for mouse
## 1 organism identified from search: 10090
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Checking for genes without orthologs in human.
## Extracting genes from input_gene.
## 17,355 genes extracted.
## Extracting genes from ortholog_gene.
## 17,355 genes extracted.
## Checking for genes without 1:1 orthologs.
## Dropping 131 genes that have multiple input_gene per ortholog_gene (many:1).
## Dropping 498 genes that have multiple ortholog_gene per input_gene (1:many).
## Filtering gene_df with gene_map
## Adding input_gene col to gene_df.
## Adding ortholog_gene col to gene_df.
## 
## =========== REPORT SUMMARY ===========
## Total genes dropped after convert_orthologs :
##    4,725 / 21,207 (22%)
## Total genes remaining after convert_orthologs :
##    16,482 / 21,207 (78%)
## --
## 
## =========== REPORT SUMMARY ===========
## 16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
## 16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
## Gathering ortholog reports.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Gene table with 19,129 rows retrieved.
## Returning all 19,129 genes from human.
## Retrieving all genes using: homologene.
## Retrieving all organisms available in homologene.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: 9606
## Gene table with 19,129 rows retrieved.
## Returning all 19,129 genes from human.
## --
## 
## =========== REPORT SUMMARY ===========
## 19,129 / 19,129 (100%) target_species genes remain after ortholog conversion.
## 19,129 / 19,129 (100%) reference_species genes remain after ortholog conversion.
## 16,482 intersect background genes used.
## Standardising CellTypeDataset
## Checking gene list inputs.
## Running with gene size control.
## Warning: sctSpecies_origin not provided. Setting to 'mouse' by default.
## Retrieving all genes using: gprofiler
## Retrieving all organisms available in gprofiler.
## Using stored `gprofiler_orgs`.
## Mapping species name: human
## Common name mapping found for human
## 1 organism identified from search: hsapiens
## Gene table with 63,150 rows retrieved.
## Returning all 63,150 genes from human.
## 63,150 human Ensembl IDs and 40,977 human Gene Symbols imported.
## see ?ewceData and browseVignettes('ewceData') for documentation
## loading from cache
## Controlled bootstrapping network generated.
## 17 hit gene(s) remain after filtering.
## Computing gene scores.
## Using previously sampled genes.
## Computing gene counts.
## Testing for enrichment in 48 cell types...
## Sorting results by p-value.
## Computing BH-corrected q-values.
## 0 significant cell type enrichment results @ q<0.05 :

2.3.3.1 Plot results

try({
  plot_list <- EWCE::ewce_plot(total_res = cont_results$results,
                             mtc_method = "BH")
  print(plot_list$plain)
})