Contents

1 Introduction


Why do we need group markers

In biological and clinical research, the identification of biomarkers specific to a disease, tissue, or cell type is a critical step in advancing our understanding and application of this knowledge.

Specific biomarkers can be identified through various methods, such as transcriptomics, proteomics, or metabolomics, which can provide a global view of the molecular landscape of the system being studied.

To denote these biomarkers, we use the term “group markers” in a generic sense, referring to any molecular feature that is specific to a particular group or condition of interest.

Such markers can provide insight into disease diagnosis, prognosis, and treatment options, and can help to differentiate between diseased and healthy states. Thus the identification of the group markers is crucial in various biological and medical applications, as it allows us to distinguish between different cell types or disease states.

For example, in cancer research, identifying marker genes that are differentially expressed between cancer cells and normal cells can help diagnose and monitor the progression of the disease, as well as identify potential therapeutic targets. Similarly, in developmental biology, identifying marker genes that are specific to certain cell types or stages can help us understand the underlying mechanisms of differentiation and development.

Furthermore, marker genes can also be used in diagnostic assays to detect specific diseases or monitor treatment responses. For instance, the presence of certain marker genes in a patient’s blood or tissue sample can indicate the presence or progression of a disease.

Importance of immune cells signature in tumor micro-environment (TME)

In this demo, we will be using the example of immune cell signature in the context of the TME as a practical case study to showcase the application of our approach.

TME is made up of a diverse range of cell types (fibroblasts, epithelial cells, endothelial cells, and immune cells) as well as various extracellular components (collagens, growth factors, hormones, cytokines, etc.). Tumor immune micro-environment (TIME) is reported to be highly associated with prognosis and various treatment response to many kinds of cancers.

Recent studies have highlighted the role of immune components in the TME in modulating tumor progression, making them attractive therapeutic targets. These components make up TIME, which is a subset of the TME. Subsequently, it’s proved that tumor infiltrating lymphocytes (TILs) play an essential role in tumor progression, metastasis and treatment response.

This drives TILs to be a strong prognostic indicator for better precision therapy of cancer patients. The identification of these TILs are the subject of extensive research interest due to the roles of specific subset of immune cells acting on different tissues types.

Identification of these cell types are based on flow cytometry in the past, which has limited granularity. But now with the advance of single cell RNA-seq and spatial transcriptomics technologies, more detailed and novel cell types or subtypes can be identified. One of the key advantages of scRNA-seq and spatial technologies is the ability to investigate cell types in the TIME and understand cellular heterogeneity in the TME.

To quantify TILs infiltration (in bulk), estimate cellular composition (in bulk), annotate single cell types (in scRNA) or identify sample states/activities, many computational methods like cell deconvolution (CIBERSORTx), marker based annotation (CelliD) and sample scoring (singscore) are developed. But to distinguish between closely related cell types, estimate cell composition and states, a refined signature is required which typically requires manual curation by domain expert. With the increasing large amount of data and cell types, this is gradually getting to be untenable and will require a different approach to automatically screen such signatures.

mastR is a software package specifically designed to automatically screen signatures of interest for particular research questions. This automated process saves significant time and effort that would otherwise be required for manual labor.

What this package does

mastR, Markers Automated Screening Tool in R, is a package to automatically derive signature for specific group of interest in specific tissue.

With mastR, users can simply input their expression data containing the groups of interest, along with group labels, to obtain a list of marker genes (signature) for the target group.

While there are many tools available for generating cell type-specific markers, they are primarily designed for scRNA-seq data and often rely on machine learning algorithms to select relevant features for classification. However, these methods may lack robustness and consistency across datasets when compared to using statistical methods like empirical Bayesian in limma. Furthermore, some of these tools may return a signature even when the data does not contain any, leading to potential inaccuracies.

Although differential expression (DE) analysis can also be done on scRNA data, like Seurat::FindMarkers(), it’s reported that DE analysis done on pseudo-bulk scRNA data is more robust and reliable than directly done on scRNA data.

Thus, mastR is designed to generate a more refined list of signature genes from multiple group comparisons based on the results from edgeR and limma DE analysis workflow. The final signature is selected by rank product score test for picking up genes with high ranks in the most of comparisons. The rank can be ordered by any gene statistic generated by limma analysis. Signature can be further refined by keeping the top n DEGs in the specified comparison(s), which can help to improve the discrimination between fairly similar cell types.

Another unique advantage of mastR is that it takes into account the background expression of tissue-specificity, which is often ignored by other marker generation tools. Unlike most tools that only consider the genes’ contribution to the classification, mastR also offers a background expression removal function. This function is designed to remove parts of the marker genes that are highly expressed in specific tissues or cancer cells. These signals from the background, regarded as background expression, can cause potential ambiguity when applying the markers to specific tissues due to background or sample purity issues. This feature makes mastR apart and enhances the accuracy and specificity of the generated markers.

Furthermore, mastR allows users to build a markers pool before signature screening, which might contain the potential markers of interest to the users. The final signature will be integrated with this pool (by intersection), which can help to constrain the final signature within the interested pathway-related genes or functional gene-sets. People can borrow the published knowledge to build this.

The motivation of this package arises from the importance and necessity of identifying specific genes that are differentially expressed in different groups or tissues, as these genes can serve as biomarkers for diagnosis, prognosis, and therapeutic targeting. However, identifying marker genes can be a challenging and time-consuming task, and the presence of background expression can lead to erroneous results. Our package simplifies and streamlines this process, allowing researchers to focus on their analyses and interpretations without the burden of manual marker gene selection and background expression removal.

This report demonstrates the main functions and applications of mastR 1.7.0.

This report demonstrates the signature screening workflow of NK cells in colorectal cancer (CRC), assessing the results by using in-built visualization functions.

mastR screen the signature using the following 3 key steps:

step 1. build markers pool
step 2. identify signature from the pool
step 3. refine signature by background expression
step 4. visualize signature performance

Applications

2 Installation


mastR R package can be installed from Bioconductor or GitHub.

The most updated version of mastR is hosted on GitHub and can be installed using devtools::install_github() function provided by devtools.

# if (!requireNamespace("devtools", quietly = TRUE)) {
#   install.packages("devtools")
# }
# if (!requireNamespace("mastR", quietly = TRUE)) {
#   devtools::install_github("DavisLaboratory/mastR")
# }

if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (!requireNamespace("mastR", quietly = TRUE)) {
  BiocManager::install("mastR")
}
packages <- c(
  "BiocStyle",
  "clusterProfiler",
  "ComplexHeatmap",
  "depmap",
  "enrichplot",
  "ggrepel",
  "gridExtra",
  "jsonlite",
  "knitr",
  "rmarkdown",
  "RobustRankAggreg",
  "rvest",
  "singscore",
  "UpSetR"
)
for (i in packages) {
  if (!requireNamespace(i, quietly = TRUE)) {
    install.packages(i)
  }
  if (!requireNamespace(i, quietly = TRUE)) {
    BiocManager::install(i)
  }
}
library(mastR)
library(edgeR)
library(ggplot2)
library(GSEABase)

3 Step 1. Build Markers Pool


The first step is to define the original markers pool this analysis will be based on.

The final signature will only be the intersected genes with this markers pool. The whole gene list of the data will be regarded as the markers pool if no preliminary result or interested genes are provided.

Note: markers = NULL won’t keep any special genes if they fail the filtration by edgeR.

If users have any preliminary knowledge about the target group type (cell type), they can build the markers pool of interest from the available datasets or build from MSigDB, PanglaoDB or LM7/LM22 using our in-built functions. All genes in the pool will be reserved for DE analysis even even if they are filtered out during the filtration of edgeR.

The standard pool building process involves the following:

  1. generate from sources
    1. LM7/22 signature matrix for CIBERSORT
    2. MSigDB
    3. PanglaoDB
    4. Customized gene list
  2. merge gene-sets together

mastR allows the following markers to be conveniently loaded:

All markers generation functions will return GeneSet or GeneSetCollection object.

3.1 Generate Markers from Sources

To generate the immune cell signatures, firstly users need to define a pool of markers of interest. mastR allows users to build markers pool from multiple resources.

In this demo, we will load some some publicly available example datasets.

3.1.1 i) Leukocyte gene signature Matrix (LM)

lm7/lm22 are immune cells signature matrices from CIBERSORT, contains 7 or 22 immune cell types.

Users can use function get_lm_sig() to generate immune cells markers from LM7 and/or lm22.

data("lm7", "lm22")

## collect both LM7 and LM22
LM <- get_lm_sig(lm7.pattern = "^NK", lm22.pattern = "NK cells")
LM
#> GeneSetCollection
#>   names: LM7, LM22 (2 total)
#>   unique identifiers: CD244, FASLG, ..., ZNF135 (92 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: NullCollection (1 total)

Function gsc_plot() allows visualization of an UpSetR plot across all gene-sets.

## show upset diagram
gsc_plot(LM)

3.1.2 ii) MSigDB

The Molecular Signatures Database (MSigDB) is a database of annotated gene sets, typically used for pathway analysis.

Users can use get_gsc_sig() to generate a collection of gene sets for the biological subjects of interest.

In this case, NK cell relevant gene-sets from msigdb_gobp_nk are collected.

data("msigdb_gobp_nk")
MSig <- get_gsc_sig(
  gsc = msigdb_gobp_nk,
  pattern = "NATURAL_KILLER_CELL_MEDIATED"
)
MSig
#> GeneSetCollection
#>   names: GOBP_NATURAL_KILLER_CELL_MEDIATED_IMMUNITY, GOBP_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY_DIRECTED_AGAINST_TUMOR_CELL_TARGET, ..., GOBP_POSITIVE_REGULATION_OF_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY (18 total)
#>   unique identifiers: AP1G1, ARRB2, ..., KLRC4-KLRK1 (67 total)
#>   types in collection:
#>     geneIdType: SymbolIdentifier (1 total)
#>     collectionType: BroadCollection (1 total)

Note: data = "msigdb" allows searching of MsigDB without loading into the environment. species or version are optional parameters.

Similarly, using gsc_plot() users can visualize the overlapping gene sets.

As the gene-set names are too long, for better visualization the gene-set names are replaced with letters and shown below.

## cut gene set name within 11 characters
gsn <- setNames(names(MSig), LETTERS[seq_along(MSig)])
for (i in seq_along(MSig)) {
  setName(MSig[[i]]) <- LETTERS[i]
}

## show upset diagram of collected gene-sets
gsc_plot(MSig)

gsn[c("A", "M", "D")] ## show gene-set names of top 3
#>                                                          A 
#>               "GOBP_NATURAL_KILLER_CELL_MEDIATED_IMMUNITY" 
#>                                                          M 
#>           "GOBP_NATURAL_KILLER_CELL_MEDIATED_CYTOTOXICITY" 
#>                                                          D 
#> "GOBP_REGULATION_OF_NATURAL_KILLER_CELL_MEDIATED_IMMUNITY"

There are 18 gene-sets in MSig, which is too many for visualization. Thus, we use function merge_markers() to merge all gene-sets into one GeneSet object. (Note: Users can directly use the un-merged object for subsequent analyses.)

## merge all gene sets into one
MSig <- merge_markers(MSig)
setName(MSig) <- "MSigDB"

3.1.3 iii) PanglaoDB

PanglaoDB is a database of single cell RNA sequencing experiments with cell type markers.

Users can use get_panglao_sig() to generate markers based on the required organs and cell types.

Functions list_panglao_organs() and list_panglao_types() show all available organs and cell types in PanglaoDB.

Note: This requires real-time connection to PanglaoDB, thus not run in this demo.

## show available organs on PanglaoDB
list_panglao_organs()

## show available cell types of interest organ on PanglaoDB
list_panglao_types(organ = "Immune system")

## collect all "NK cells" markers from PanglaoDB website
Panglao <- get_panglao_sig(type = "NK cells")

Panglao

3.1.4 iv) Customized gene list

Customized markers can be imported as a GeneSet object.

Here we demonstrate this by loading the in-built nk_markers markers dataset and convert it into GeneSet object.

## show what nk_markers looks like:
data("nk_markers")
nk_markers
#> # A tibble: 114 × 4
#>    HGNC_Symbol LM22  LM7   Huntington
#>    <chr>       <chr> <chr> <chr>     
#>  1 APOBEC3G    TRUE  -     -         
#>  2 APOL6       TRUE  -     -         
#>  3 AZU1        TRUE  -     -         
#>  4 BPI         TRUE  -     -         
#>  5 CAMP        TRUE  -     -         
#>  6 CCL4        TRUE  -     -         
#>  7 CCL5        TRUE  -     TRUE      
#>  8 CCND2       TRUE  -     -         
#>  9 CD160       TRUE  -     -         
#> 10 CD2         TRUE  -     -         
#> # ℹ 104 more rows

## convert NK markers into `GeneSet` object
nk_m <- GeneSet(nk_markers$HGNC_Symbol,
  geneIdType = SymbolIdentifier(),
  setName = "NK_markers"
)

3.2 Pool Markers from Sources

With multiple lists of markers from different sources, use merge_markers() to merge them into one GeneSet object.

gsc <- GeneSetCollection(c(nk_m, LM, MSig)) ## add Panglao if you run it
Markers <- merge_markers(gsc)

## upset plot
gsc_plot(gsc)


Markers
#> setName: merged_markers_pool 
#> geneIds: APOBEC3G, APOL6, ..., KLRC4-KLRK1 (total: 167)
#> geneIdType: Symbol
#> collectionType: Computed 
#> details: use 'details(object)'

The summary of the merged list is saved as longDescription in the output.

## to show the table summary of merged list
head(jsonlite::fromJSON(GSEABase::longDescription(Markers)))
#>       Gene NK_markers LM7 LM22 MSigDB
#> 1    AP1G1          -   -    -   TRUE
#> 2 APOBEC3G       TRUE   - TRUE      -
#> 3    APOL6       TRUE   - TRUE      -
#> 4    ARRB2          -   -    -   TRUE
#> 5     AZU1       TRUE   - TRUE      -
#> 6      BPI       TRUE   - TRUE      -

Now with a merged markers pool, we refine the signature genes for group specificity and tissue background removal.

4 Step 2. Signature Identification for Target Group


For group specificity, there are 3 main steps:

  1. differential expression (DE) analysis: filtration, normalization, sample weighting and linear model fit;
  2. feature selection: select differentially expressed genes based on rank product score;
  3. constrain selected genes within the markers pool.

Note: External data is accepted in different formats (e.g. DGEList, eSet, matrix). Input data must be raw counts or log-transformed expression data.

In this demo, we use the imported example data im_data_6 from GSE60424 (Download using GEOquery::getGEO()), consisting of immune cells from healthy individuals.

im_data_6 is a eSet object, containing RNA-seq TMM normalized counts data of 6 sorted immune cell types each with 4 samples. More details in ?mastR::im_data_6.

data("im_data_6")
im_data_6
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 50045 features, 24 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   sampleNames: GSM1479438 GSM1479439 ... GSM1479525 (24 total)
#>   varLabels: title geo_accession ... years since diagnosis:ch1 (66
#>     total)
#>   varMetadata: labelDescription
#> featureData: none
#> experimentData: use 'experimentData(object)'
#>   pubMedIds: 25314013 
#> Annotation: GPL15456

4.1 a) data processing

To screen signature for the group specificity, the data needs to be pre-processed.

process_data() in mastR does the ‘End-to-End’ differential expression analysis using edgeR and limma pipeline as a single function call.

Processes under the hood:

  • Filter data by the given cutoff, genes with low expression will be removed by edgeR.

  • If data is raw counts (normalize = TRUE), normalize data by ‘TMM’ and fit it using limma::voom(). Otherwise apply trend of limma on normalized data.

  • Fit linear model.

  • Compute gene statistic for differential expression analysis using limma::treat().

im_data_6 has 6 immune cell types:

table(im_data_6$`celltype:ch1`)
#> 
#>     B-cells         CD4         CD8   Monocytes          NK Neutrophils 
#>           4           4           4           4           4           4

process_data() requires an expression matrix and group labels of the samples. This returns a DGEList object with processed data.

To keep consistent with Markers, use param gene_id to convert ENSEMBL IDs of im_data_6 to SYMBOLs for intersection.

Refer to help(proc_data) for optional parameters.

proc_data <- process_data(
  data = im_data_6,
  group_col = "celltype:ch1",
  target_group = "NK",
  markers = geneIds(Markers),
  gene_id = "ENSEMBL" ## rownames of im_data_6 is ENSEMBL ID
)
#> 'select()' returned 1:many mapping between keys and columns
#>        NK-Neutrophils NK-Monocytes NK-B.cells NK-CD4 NK-CD8
#> Down             4012         3949       3146   2698   2155
#> NotSig           1492         2694       4429   5001   6201
#> Up               4945         3806       2874   2750   2093

attributes(proc_data)
#> $names
#> [1] "counts"          "samples"         "original_counts" "vfit"           
#> [5] "tfit"           
#> 
#> $class
#> [1] "DGEList"
#> attr(,"package")
#> [1] "edgeR"

For ease of process later in this demo, add the expression matrix E fitted by limma::voom() as ‘voomE’ into proc_data.

## add voom fitted expression as a new list of proc_data for use
proc_data$voomE <- proc_data$vfit$E

Note:

mastR provides visualization functions to compare the data before and after process_data() and assess the quality control (QC) by using plot_diagnostics() and plot_mean_var().

To assess the removal of the low quality genes, use plot_diagnostics() to show the expression distribution, RLE and MDS plots.

4.2 b) signature selection based on differential expression

For selection of group specific signature, pass proc_data to select_sig() function. Genes with high rank product score in DE results are selected.

mastR automatically determines if feature selection is required, the default approach (feature_selection = "auto") performs rank product scoring to select genes. But if the numbers of the result are < 5, no feature selection will be conducted (switch to feature_selection = "none").

## get the same result as there's permutation test for rank product
set.seed(123)
sig_ct <- select_sig(proc_data, feature_selection = "auto")

head(sig_ct)
#> GeneSetCollection
#>   names: UP, DOWN (2 total)
#>   unique identifiers: ENSG00000149294, ENSG00000173068, ..., ENSG00000162591 (98 total)
#>   types in collection:
#>     geneIdType: NullIdentifier (1 total)
#>     collectionType: NullCollection (1 total)

Note:

mastR also implements a feature to further optimize the discriminative power of the signature between fairly similar groups by using params keep.top and keep.group. More details in help(select_sig).

Tips:

All above steps from 4.1 to 4.2 for refining group signature can be done using an integrated function get_degs().

It will return a list of a processed data proc_data and a GeneSetCollection DEGs with UP and DOWN regulated genes.

QC plots can also be shown by setting optional param plot = TRUE.

4.3 c) constrain signature within markers pool

To constrain the final signature within the interested gene list, intersect the signature genes sig_ct with the markers pool Markers. In this demo, only the UP regulated genes are kept.

For consistency with Markers type, convert ENSEMBL IDs of im_data_6 to gene SYMBOLs.

## convert ensembl IDs into symbols to match markers pool
deg_up <- mapIds(
  org.Hs.eg.db::org.Hs.eg.db,
  geneIds(sig_ct[["UP"]]),
  "SYMBOL", "ENSEMBL"
)
#> 'select()' returned 1:1 mapping between keys and columns
deg_up <- na.omit(deg_up)
## markers specific for NK cells
m_ct <- intersect(geneIds(Markers), deg_up)
names(m_ct) <- names(deg_up)[match(m_ct, deg_up)] ## set ensembl ID as names for downstream visualization

head(m_ct)
#> ENSG00000122223 ENSG00000198821 ENSG00000172543 ENSG00000145649 ENSG00000197540 
#>         "CD244"         "CD247"          "CTSW"          "GZMA"          "GZMM" 
#> ENSG00000100385 
#>         "IL2RB"

Users can use pca_matrix_plot() to assess the group separation.

Looking at the variance of PC1 shown below, it’s clear the intersected genes explain more variance in PC1 compared with all UP DEGs.

## PCA shows clear separation of NK cells
## after intersection
pca_matrix_plot(proc_data,
  features = m_ct,
  group_by = "celltype.ch1",
  slot = "voomE",
  n = 3,
  gene_id = "ENSEMBL"
) +
  patchwork::plot_annotation("Intersected UP DEGs")
#> 'select()' returned 1:many mapping between keys and columns

## before intersection
pca_matrix_plot(proc_data,
  features = as.vector(deg_up),
  group_by = "celltype.ch1",
  slot = "voomE",
  n = 3,
  gene_id = "ENSEMBL"
) +
  patchwork::plot_annotation("All UP DEGs")
#> 'select()' returned 1:many mapping between keys and columns