gemma.R 2.99.6
library(gemma.R)
library(dplyr)
library(pheatmap)
The data in Gemma are manually annotated by curators with terms, often using an ontology term on both dataset and sample level. In Gemma.R three primary functions allow access to these annotations for a given dataset.
get_dataset_annotations
: This function returns annotations associated with a dataset. These try to serve as tags describing the dataset as a whole and they characteristics that samples within the datasets have while also including some additional terms.
get_dataset_samples
: This function returns samples and associated annotations related to their experimental groups for an experiment
get_dataset_differential_expression_analyses
: This function returns information about differential expression analyses automatically performed by Gemma for a given experiment. Each row of the output is a contrast where a specific property or an interaction of properties are described.
In the examples below we will be referring to GSE18728 experiment, where drug resistant cancer patients and
Samples and differential expression contrasts in Gemma are annotated with factor values. These values contain statements that describe these samples and which samples belong to which experimental in a differential expression analysis respectively.
In gemma.R these values are stored in nested data.table
s and can be
found by accessing the relevant columns of the outputs. Annotations for samples
can be accessed using get_dataset_samples
. sample.factorValues
column contains
the relevant information
samples <- get_dataset_samples('GSE18728')
samples$sample.factorValues[[1]] %>%
gemma_kable()
category | category.URI | value | value.URI | predicate | predicate.URI | object | object.URI | summary | ID | factor.ID | factor.category | factor.category.URI |
---|---|---|---|---|---|---|---|---|---|---|---|---|
collection of material | http://www.ebi…/EFO_0005066 | DE_Include | http://gemma.msl…/TGEMO_00013 | NA | NA | NA | NA | DE_Include | 134952 | 24474 | collection of material | http://www.ebi…/EFO_0005066 |
block | http://www.ebi…/EFO_0005067 | Batch_09_5/20/05 | NA | NA | NA | NA | NA | Batch_09_5/20/05 | 184673 | 39147 | block | http://www.ebi…/EFO_0005067 |
phenotype | http://www.ebi…/EFO_0000651 | resistant to | http://purl.obolibrary…/PATO_0001178 | toward | http://purl.obolibrary…/RO_0002503 | capecitabine | http://purl.obolibrary…/CHEBI_31348 | resistant to toward docetax… | 76119 | 11430 | phenotype | http://www.ebi…/EFO_0000651 |
phenotype | http://www.ebi…/EFO_0000651 | resistant to | http://purl.obolibrary…/PATO_0001178 | toward | http://purl.obolibrary…/RO_0002503 | docetaxel trihydrate | http://purl.obolibrary…/CHEBI_59809 | resistant to toward docetax… | 76119 | 11430 | phenotype | http://www.ebi…/EFO_0000651 |
treatment | http://www.ebi…/EFO_0000727 | reference substance role | http://purl.obolibrary…/OBI_0000025 | NA | NA | NA | NA | reference substance role | 134950 | 24473 | treatment | http://www.ebi…/EFO_0000727 |
The example above shows a single factor value object for one sample. The rows of this
data.table
are statements that belong to a factor value. Below each column of this
nested table is described. If a given field is filled by an ontology term, the corresponding
URI column will contain the ontology URI for the field.
category
/category.URI
: Category of the individual statement, such as treatment,
phenotype or strainvalue
/value.URI
: The subject of the statement.predicate
/predicate.URI
: When a subject alone is not enough to describe all
details, a statement can contain a predicate and an object. The predicate describes
the relationship between the subject of the statement and the object. In the example
above, these are used to represent resistance to cancer drugs.object
/object.URI
: The object of a statement is a property further describing
it’s value such as the dosage of a drug or a timepoint.summary
: A plain text summary of the factorValue. Different statements will
have the same summary if they are part of the same factor valueID
: An integer identifier for the specific factor value. In the example above,
phenotype of the patient is defined as a single factor value made up of two statements
stating that the patient is resistant to both capecitabine and docetaxel trihydrate. This
factor value has the ID of 76119
which is shared by both rows containing the statements describing it.
This ID will repeat for every other patient that has the same drug resistance
or differential expression results using that factor as a part of their contrast. For
instance we can see which samples that was subjected to this condition using this ID
instead of trying to match the other columns describing the statementsid <- samples$sample.factorValues[[1]] %>%
filter(value == 'resistant to') %>% {.$ID} %>% unique
# count how many patients has this phenotype
samples$sample.factorValues %>% sapply(\(x){
id %in% x$ID
}) %>% sum
## [1] 38
factor.ID
: An integer identifier for the factor. A factor holds specific factor
values. For the example above whether or not a patient has drug resistance is
stored under the idWe can use this to fetch all distinct phenotypes
id <- samples$sample.factorValues[[1]] %>%
filter(value == 'resistant to') %>% {.$factor.ID} %>% unique
samples$sample.factorValues %>% lapply(\(x){
x %>% filter(factor.ID == id) %>% {.$summary}
}) %>% unlist %>% unique
## [1] "resistant to toward docetaxel trihydrate, resistant to toward capecitabine"
## [2] "reference subject role"
This shows us the dataset has control patients and patients who are resistant to
capecitabine and trihydrate.
. This ID can be used to match the factor between samples and between samples
and differential expression experiments
- factor.category
/factor.category.URI
: The category of the whole factor. Usually
this is the same with the category
of the statements making up the factor value.
However in cases like the example above, where the value describes a treatment while
the factor overall represents a phenotype, they can differ.
gemma.R includes a convenience function to create a simplified design matrix out of these factor values for a given experiment. This will unpack the nested data.frames and provide a more human readable output, giving each available factor it’s own column.
design <- make_design(samples)
design[,-1] %>% head %>% # first column is just a copy of the original factor values
gemma_kable()
collection of material | block | phenotype | treatment | |
---|---|---|---|---|
pt15.bl | DE_Include | Batch_09_5/20/05 | resistant to toward docetax… | reference substance role |
pt3.bl | DE_Include | Batch_07_5/18/05 | reference subject role | reference substance role |
pt11.or | DE_Exclude | Batch_05_4/19/05 | resistant to toward docetax… | |
pt24.bl | DE_Include | Batch_06_4/20/05 | resistant to toward docetax… | reference substance role |
pt29.or | DE_Exclude | Batch_01_3/8/05 | resistant to toward docetax… | |
pt20.or | DE_Exclude | Batch_04_4/7/05 | resistant to toward docetax… |
Using this output, here we look at the sample sizes for different experimental groups.
design %>%
group_by(treatment,phenotype,`collection of material`) %>%
summarize(n= n()) %>%
arrange(desc(n)) %>%
gemma_kable()
## `summarise()` has grouped output by 'treatment', 'phenotype'. You can override
## using the `.groups` argument.
treatment | phenotype | collection of material | n |
---|---|---|---|
resistant to toward docetax… | DE_Exclude | 16 | |
reference substance role | resistant to toward docetax… | DE_Include | 13 |
docetaxel trihydrate, capec… | reference subject role | DE_Include | 9 |
docetaxel trihydrate, capec… | resistant to toward docetax… | DE_Include | 9 |
reference substance role | reference subject role | DE_Include | 8 |
reference subject role | DE_Exclude | 6 |
For most experiments it contains, Gemma performs automated differential expression analyses. The kinds of analyses that will be performed is informed by the factor values belonging to the samples.
# removing columns containing factor values and URIs for brevity
remove_columns <- c('baseline.factors','experimental.factors','subsetFactor','factor.category.URI')
dea <- get_dataset_differential_expression_analyses("GSE18728")
dea[,.SD,.SDcols = !remove_columns] %>%
gemma_kable()
result.ID | contrast.ID | experiment.ID | factor.category | factor.ID | isSubset | probes.analyzed | genes.analyzed |
---|---|---|---|---|---|---|---|
573720 | 134951 | 2914 | treatment | 24473 | TRUE | 54674 | 21270 |
573721 | 76119_134951 | 2914 | phenotype,treatment | 11430,24473 | TRUE | 54672 | 21269 |
573722 | 76119 | 2914 | phenotype | 11430 | TRUE | 54674 | 21270 |
The example above shows the differential expression analyses results. Each row of this data.table
represents a differential expression contrast connected to a fold change and a p value in the output of
get_differential_expression_values
function.
If we look at the contrast.ID
we will see the factor value identifiers returned in the ID
column of our
sample.factorValues
. These represent which factor value is used as the
experimental factor. Note that some rows will have two IDs appended together. These
represent the interaction effects of multiple factors. For simplicty, we will start
from a contrast without an interaction.
contrast <- dea[1,]
# removing URIs for brevity
uri_columns = c('category.URI',
'object.URI',
'value.URI',
'predicate.URI',
'factor.category.URI')
contrast$baseline.factors[[1]][,.SD,.SDcols = !uri_columns] %>%
gemma_kable()
category | value | predicate | object | summary | ID | factor.ID | factor.category |
---|---|---|---|---|---|---|---|
treatment | reference substance role | NA | NA | reference substance role | 134950 | 24473 | treatment |
contrast$experimental.factors[[1]][,.SD,.SDcols = !uri_columns] %>%
gemma_kable()
category | value | predicate | object | summary | ID | factor.ID | factor.category |
---|---|---|---|---|---|---|---|
treatment | capecitabine | NA | NA | docetaxel trihydrate, capec… | 134951 | 24473 | treatment |
treatment | docetaxel trihydrate | NA | NA | docetaxel trihydrate, capec… | 134951 | 24473 | treatment |
Here, we can see the baseline is the non-resistant controls, being compared to drug resistant patients.
If we examine a factor with interaction, both baseline and experimental factor value columns will contain two factor values.
contrast <- dea[2,]
contrast$baseline.factors[[1]][,.SD,.SDcols = !uri_columns] %>%
gemma_kable()
category | value | predicate | object | summary | ID | factor.ID | factor.category |
---|---|---|---|---|---|---|---|
NA | NA | NA | NA | NA | NA | NA | NA |
:——– | :—– | :——— | :—— | :——- | –: | ———: | :————— |
contrast$experimental.factors[[1]][,.SD,.SDcols = !uri_columns] %>%
gemma_kable()
category | value | predicate | object | summary | ID | factor.ID | factor.category |
---|---|---|---|---|---|---|---|
phenotype | resistant to | toward | capecitabine | resistant to toward docetax… | 76119 | 11430 | phenotype |
phenotype | resistant to | toward | docetaxel trihydrate | resistant to toward docetax… | 76119 | 11430 | phenotype |
treatment | capecitabine | NA | NA | docetaxel trihydrate, capec… | 134951 | 24473 | treatment |
treatment | docetaxel trihydrate | NA | NA | docetaxel trihydrate, capec… | 134951 | 24473 | treatment |
A third place that can contain factorValues is
the subsetFactor
. Certain differential expression analyses disclude certain samples
based on a given factor. In this example we can see the curators added a specific
factor to remove some samples from differential expression analyses
contrast$subsetFactor[[1]][,.SD,.SDcols = !uri_columns] %>%
gemma_kable()
category | value | predicate | object | summary | ID | factor.ID | factor.category |
---|---|---|---|---|---|---|---|
collection of material | DE_Include | NA | NA | DE_Include | 134952 | 24474 | NA |
Looking at the design table we made above, you can see that these were samples where treatment was not specified.
The ids of the factor values included in baseline.factors
and experimental.factors
along
with subsetFactor
can be used to determine which samples represent a given contrast.
For convenience, get_dataset_object
function which is used to compile metadata
and expression data of an experiment in a single object, includes resultSets
and contrasts
argument which will return the data already composed of samples representing a particular contrast.
obj <- get_dataset_object("GSE18728",resultSets = contrast$result.ID,contrasts = contrast$contrast.ID,type = 'list')
obj[[1]]$design[,-1] %>%
head %>% gemma_kable()
collection of material | block | phenotype | treatment | |
---|---|---|---|---|
pt15.bl | DE_Include | Batch_09_5/20/05 | resistant to toward docetax… | reference substance role |
pt3.bl | DE_Include | Batch_07_5/18/05 | reference subject role | reference substance role |
pt24.bl | DE_Include | Batch_06_4/20/05 | resistant to toward docetax… | reference substance role |
pt17.bl | DE_Include | Batch_06_4/20/05 | resistant to toward docetax… | reference substance role |
pt26.bl | DE_Include | Batch_09_5/20/05 | resistant to toward docetax… | reference substance role |
pt17.c2 | DE_Include | Batch_06_4/20/05 | resistant to toward docetax… | docetaxel trihydrate, capec… |
We suggested that the contrast.ID
of a contrast also corresponded to a column
in the differential expression results, acquired by get_differential_expression_values
.
We can use what we have learned to take a look at the expression of genes at the top of the
phenotype, treatment interaction. Each result.ID returns its separate table when accessing diferential expression values.
dif_vals <- get_differential_expression_values('GSE18728')
dif_vals[[as.character(contrast$result.ID)]] %>% head %>%
gemma_kable()
Probe | NCBIid | GeneSymbol | GeneName | pvalue | corrected_pvalue | rank | contrast_76119_134951_log2fc | contrast_76119_134951_tstat | contrast_76119_134951_pvalue |
---|---|---|---|---|---|---|---|---|---|
205278_at | 2571 | GAD1 | glutamate decarboxylase 1 | 0.8487 | 0.9780 | 0.8677 | 0.0950 | 0.1922 | 0.8487 |
220185_at | 57731 | SPTBN4 | spectrin beta, non-erythroc… | 0.1935 | 0.7691 | 0.2515 | 0.1804 | 1.3250 | 0.1935 |
206956_at | 632|100527963 | BGLAP|PMF1-BGLAP | bone gamma-carboxyglutamate… | 0.2722 | 0.8151 | 0.3339 | -0.2307 | -1.1150 | 0.2722 |
1554700_at | 1005 | CDH7 | cadherin 7 | 0.7024 | 0.9529 | 0.7371 | 0.0325 | 0.3851 | 0.7024 |
1570225_at | 0.4709 | 0.8919 | 0.5279 | -0.0714 | -0.7287 | 0.4709 | |||
208930_s_at | 3609 | ILF3 | interleukin enhancer bindin… | 0.1652 | 0.7500 | 0.2203 | 0.3637 | 1.4164 | 0.1652 |
To get the top genes found associated with this interaction we access the columns with
the correct contrast.ID
.
# getting the top 10 genes
top_genes <- dif_vals[[as.character(contrast$result.ID)]] %>%
arrange(across(paste0('contrast_',contrast$contrast.ID,'_pvalue'))) %>%
filter(GeneSymbol!='' | grepl("|",GeneSymbol,fixed = TRUE)) %>% # remove blank genes or probes with multiple genes
{.[1:10,]}
top_genes %>% select(Probe,NCBIid,GeneSymbol) %>%
gemma_kable()
Probe | NCBIid | GeneSymbol |
---|---|---|
224643_at | 133619 | PRRC1 |
217791_s_at | 5832 | ALDH18A1 |
207409_at | 3950 | LECT2 |
204802_at | 6236 | RRAD |
215423_at | 10299 | MARCHF6 |
202721_s_at | 2673 | GFPT1 |
221734_at | 133619 | PRRC1 |
218897_at | 80775 | TMEM177 |
222416_at | 5832 | ALDH18A1 |
214618_at | 8837 | CFLAR |
exp_subset<- obj[[1]]$exp %>%
filter(Probe %in% top_genes$Probe)
genes <- top_genes$GeneSymbol
# ordering design file
design <- obj[[1]]$design %>% arrange(treatment,phenotype)
# shorten the resistance label a bit
design$phenotype[design$phenotype == "resistant to toward capecitabine, resistant to toward docetaxel trihydrate"] = "drug resistant"
exp_subset[,.SD,.SDcols = rownames(design)] %>% t %>% scale %>% t %>%
pheatmap(cluster_rows = FALSE,cluster_cols = FALSE,labels_row = genes,
annotation_col =design %>% select(phenotype,treatment))
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] listviewer_4.0.0 viridis_0.6.5
## [3] viridisLite_0.4.2 pheatmap_1.0.12
## [5] SummarizedExperiment_1.33.3 Biobase_2.63.1
## [7] GenomicRanges_1.55.4 GenomeInfoDb_1.39.14
## [9] IRanges_2.37.1 S4Vectors_0.41.7
## [11] BiocGenerics_0.49.1 MatrixGenerics_1.15.1
## [13] matrixStats_1.3.0 ggrepel_0.9.5
## [15] ggplot2_3.5.1 dplyr_1.1.4
## [17] data.table_1.15.4 gemma.R_2.99.6
## [19] BiocStyle_2.31.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.1 fastmap_1.1.1
## [4] digest_0.6.35 timechange_0.3.0 lifecycle_1.0.4
## [7] magrittr_2.0.3 compiler_4.4.0 rlang_1.1.3
## [10] sass_0.4.9 tools_4.4.0 utf8_1.2.4
## [13] yaml_2.3.8 knitr_1.46 labeling_0.4.3
## [16] S4Arrays_1.3.7 htmlwidgets_1.6.4 bit_4.0.5
## [19] curl_5.2.1 DelayedArray_0.29.9 xml2_1.3.6
## [22] RColorBrewer_1.1-3 abind_1.4-5 withr_3.0.0
## [25] purrr_1.0.2 grid_4.4.0 fansi_1.0.6
## [28] colorspace_2.1-0 scales_1.3.0 tinytex_0.50
## [31] cli_3.6.2 rmarkdown_2.26 crayon_1.5.2
## [34] generics_0.1.3 rstudioapi_0.16.0 httr_1.4.7
## [37] cachem_1.0.8 stringr_1.5.1 zlibbioc_1.49.3
## [40] assertthat_0.2.1 BiocManager_1.30.22 XVector_0.43.1
## [43] vctrs_0.6.5 Matrix_1.7-0 jsonlite_1.8.8
## [46] bookdown_0.39 bit64_4.0.5 magick_2.8.3
## [49] systemfonts_1.0.6 jquerylib_0.1.4 glue_1.7.0
## [52] lubridate_1.9.3 stringi_1.8.3 gtable_0.3.5
## [55] UCSC.utils_0.99.7 munsell_0.5.1 tibble_3.2.1
## [58] pillar_1.9.0 rappdirs_0.3.3 htmltools_0.5.8.1
## [61] GenomeInfoDbData_1.2.12 R6_2.5.1 evaluate_0.23
## [64] kableExtra_1.4.0 lattice_0.22-6 highr_0.10
## [67] memoise_2.0.1 bslib_0.7.0 Rcpp_1.0.12
## [70] svglite_2.1.3 gridExtra_2.3 SparseArray_1.3.5
## [73] xfun_0.43 pkgconfig_2.0.3