Contents

library(gemma.R)
library(dplyr)
library(pheatmap)
library(purrr)

1 Introduction

The data in Gemma are manually annotated by curators with terms, often using an ontology term on both dataset and sample level. In Gemma.R three primary functions allow access to these annotations for a given dataset.

In the examples below we will be referring to GSE48962 experiment, where striatum and cerebral cortex samples from control mice and mice belonging to a Huntington model (R6/2) were taken from 8 week and 12 week old mice.

2 Dataset tags

Terms returned via get_dataset_annotations are tags used to describe a dataset in general terms.

get_dataset_annotations('GSE48962') %>%
    gemma_kable
class.name class.URI term.name term.URI object.class
genotype http://www.ebi…/EFO_0000513 Overexpression http://gemma.msl…/TGEMO_00004 FactorValue
study design http://www.ebi…/EFO_0001426 SUBSET http://gemma.msl…/TGEMO_00022 ExperimentTag
strain http://www.ebi…/EFO_0005135 C57BL/6 http://www.ebi…/EFO_0022397 ExperimentTag
organism part http://www.ebi…/EFO_0000635 striatum http://purl.obolibrary…/UBERON_0002435 FactorValue
genotype http://www.ebi…/EFO_0000513 HTT [human] huntingtin http://purl.org/../3064 FactorValue
strain http://www.ebi…/EFO_0005135 CBA http://www.ebi…/EFO_0022463 ExperimentTag
organism part http://www.ebi…/EFO_0000635 cerebral cortex http://purl.obolibrary…/UBERON_0000956 FactorValue

These tags come as a class/term pairs and inherit any terms that is assigned to any of the samples. Therefore we can see all chemicals and cell types used in the experiment.

3 Factor values

Samples and differential expression contrasts in Gemma are annotated with factor values. These values contain statements that describe these samples and which samples belong to which experimental in a differential expression analysis respectively.

3.1 Sample factor values

In gemma.R these values are stored in nested data.tables and can be found by accessing the relevant columns of the outputs. Annotations for samples can be accessed using get_dataset_samples. sample.factorValues column contains the relevant information

samples <- get_dataset_samples('GSE48962')
samples$sample.factorValues[[
    which(samples$sample.name == "TSM490")
    ]] %>% 
    gemma_kable()
category category.URI value value.URI predicate predicate.URI object object.URI summary ID factor.ID factor.category factor.category.URI
block http://www.ebi…/EFO_0005067 Device=HWUSI-EAS1563_0073_F… NA NA NA NA NA Device=HWUSI-EAS1563_0073_F… 163476 32643 block http://www.ebi…/EFO_0005067
timepoint http://www.ebi…/EFO_0000724 12 weeks NA NA NA NA NA 12 weeks 120179 20543 timepoint http://www.ebi…/EFO_0000724
organism part http://www.ebi…/EFO_0000635 striatum http://purl.obolibrary…/UBERON_0002435 NA NA NA NA striatum 120172 20540 organism part http://www.ebi…/EFO_0000635
genotype http://www.ebi…/EFO_0000513 HTT [human] huntingtin http://purl.org/../3064 has_genotype http://purl.obolibrary…/GENO_0000222 CAG repeats NA HTT [human] huntingtin has_… 120175 20541 genotype http://www.ebi…/EFO_0000513
genotype http://www.ebi…/EFO_0000513 HTT [human] huntingtin http://purl.org/../3064 has_genotype http://purl.obolibrary…/GENO_0000222 Overexpression http://gemma.msl…/TGEMO_00004 HTT [human] huntingtin has_… 120175 20541 genotype http://www.ebi…/EFO_0000513

The example above shows a single factor value object for one sample. The rows of this data.table are statements that belong to a factor value. Below each column of this nested table is described. If a given field is filled by an ontology term, the corresponding URI column will contain the ontology URI for the field.

  • category/category.URI: Category of the individual statement, such as treatment, phenotype or strain
  • value/value.URI: The subject of the statement.
  • predicate/predicate.URI: When a subject alone is not enough to describe all details, a statement can contain a predicate and an object. The predicate describes the relationship between the subject of the statement and the object. In the example above, these are used to denote properties of the human HTT in the mouse models
  • object/object.URI: The object of a statement is a property further describing it’s value. In this example these describe the properties of the HTT gene in the mouse model, namely that it has CAG repeats and it is overexpressed. If the value was a drug this could be dosage or timepoint.
  • summary: A plain text summary of the factorValue. Different statements will have the same summary if they are part of the same factor value
  • ID: An integer identifier for the specific factor value. In the example above, the genotype of the mouse is defined as a single factor value made up of two statements stating the HTT gene has CAG repeats and that it is overexpressed. This factor value has the ID of 120175 which is shared by both rows containing the statements describing it. This ID will repeat for every other patient that has the same genotype or differential expression results using that factor as a part of their contrast. For instance we can see which samples that was subjected to this condition using this ID instead of trying to match the other columns describing the statements
id <- samples$sample.factorValues[[
    which(samples$sample.name == "TSM490")
]] %>% filter(value == "HTT [human] huntingtin") %>% {.$ID} %>% unique


# count how many patients has this phenotype
samples$sample.factorValues %>% sapply(\(x){
    id %in% x$ID
}) %>% sum
## [1] 12
  • factor.ID: An integer identifier for the factor. A factor holds specific factor values. For the example above whether or not the mouse is a wild type mouse or if it has a wild type genotype is stored under the id 20541

We can use this to fetch all distinct genotypes

id <- samples$sample.factorValues[[2]] %>% 
    filter(value == "HTT [human] huntingtin") %>% {.$factor.ID} %>% unique

samples$sample.factorValues %>% lapply(\(x){
    x %>% filter(factor.ID == id) %>% {.$summary}
}) %>% unlist %>% unique
## [1] "wild type genotype"                                                             
## [2] "HTT [human] huntingtin has_genotype CAG repeats and has_genotype Overexpression"

This shows us the dataset has control mice and Huntington Disease model mice.. This ID can be used to match the factor between samples and between samples and differential expression experiments - factor.category/factor.category.URI: The category of the whole factor. Usually this is the same with the category of the statements making up the factor value. However in cases like the example above, where the value describes a treatment while the factor overall represents a phenotype, they can differ.

gemma.R includes a convenience function to create a simplified design matrix out of these factor values for a given experiment. This will unpack the nested data.frames and provide a more human readable output, giving each available factor it’s own column.

design <- make_design(samples)
design[,-1] %>% head %>%  # first column is just a copy of the original factor values
    gemma_kable()
organism part block timepoint genotype
ESW183 striatum Device=HWUSI-EAS1563_0072_F… 8 weeks wild type genotype
TSM490 striatum Device=HWUSI-EAS1563_0073_F… 12 weeks HTT [human] huntingtin has_…
TCW9469 cerebral cortex Device=HWUSI-EAS1563_0053:L… 12 weeks wild type genotype
ESM184 striatum Device=HWUSI-EAS1563_0072_F… 8 weeks HTT [human] huntingtin has_…
ESW180 striatum Device=HWUSI-EAS1563_0072_F… 8 weeks wild type genotype
TSW478 striatum Device=HWUSI-EAS1563_0073_F… 12 weeks wild type genotype

Using this output, here we look at the sample sizes for different experimental groups.

design %>%
    group_by(`organism part`,timepoint,genotype) %>% 
    summarize(n= n()) %>% 
    arrange(desc(n)) %>% 
    gemma_kable()
## `summarise()` has grouped output by 'organism part', 'timepoint'. You can
## override using the `.groups` argument.
organism part timepoint genotype n
cerebral cortex 12 weeks HTT [human] huntingtin has_… 3
cerebral cortex 12 weeks wild type genotype 3
cerebral cortex 8 weeks HTT [human] huntingtin has_… 3
cerebral cortex 8 weeks wild type genotype 3
striatum 12 weeks HTT [human] huntingtin has_… 3
striatum 12 weeks wild type genotype 3
striatum 8 weeks HTT [human] huntingtin has_… 3
striatum 8 weeks wild type genotype 3

3.2 Differential expression analysis factor values

For most experiments it contains, Gemma performs automated differential expression analyses. The kinds of analyses that will be performed is informed by the factor values belonging to the samples.

# removing columns containing factor values and URIs for brevity
remove_columns <- c('baseline.factors','experimental.factors','subsetFactor','factor.category.URI')

dea <- get_dataset_differential_expression_analyses("GSE48962")

dea[,.SD,.SDcols = !remove_columns] %>% 
    gemma_kable()
result.ID contrast.ID experiment.ID factor.category factor.ID isSubset probes.analyzed genes.analyzed
492856 120175_120178 8972 genotype,timepoint 20541,20543 TRUE 18969 18972
492855 120175 8972 genotype 20541 TRUE 18968 18971
492854 120178 8972 timepoint 20543 TRUE 18968 18971
492853 120175_120178 8972 genotype,timepoint 20541,20543 TRUE 20621 20623
492852 120175 8972 genotype 20541 TRUE 20621 20623
492851 120178 8972 timepoint 20543 TRUE 20621 20623

The example above shows the differential expression analyses results. Each row of this data.table represents a differential expression contrast connected to a fold change and a p value in the output of get_differential_expression_values function. If we look at the contrast.ID we will see the factor value identifiers returned in the ID column of our sample.factorValues. These represent which factor value is used as the experimental factor. Note that some rows will have two IDs appended together. These represent the interaction effects of multiple factors. For simplicity, we will start from a contrast without an interaction.

contrast <- dea %>% 
    filter(
        factor.category == "genotype" & 
            subsetFactor %>% map_chr('value') %>% {.=='cerebral cortex'} # we will talk about subsets in a moment
        )
# removing URIs for brevity
uri_columns = c('category.URI',
                'object.URI',
                'value.URI',
                'predicate.URI',
                'factor.category.URI')

contrast$baseline.factors[[1]][,.SD,.SDcols = !uri_columns] %>% 
     gemma_kable()
category value predicate object summary ID factor.ID factor.category
genotype wild type genotype NA NA wild type genotype 120174 20541 genotype
contrast$experimental.factors[[1]][,.SD,.SDcols = !uri_columns] %>% 
     gemma_kable()
category value predicate object summary ID factor.ID factor.category
genotype HTT [human] huntingtin has_genotype CAG repeats HTT [human] huntingtin has_… 120175 20541 genotype
genotype HTT [human] huntingtin has_genotype Overexpression HTT [human] huntingtin has_… 120175 20541 genotype

Here, we can see the baseline is the wild type mouse, being compared to the Huntington Disease models

If we examine a factor with interaction, both baseline and experimental factor value columns will contain two factor values.

contrast <- dea %>% 
    filter(
        factor.category == "genotype,timepoint" & 
            subsetFactor %>% map_chr('value') %>% {.=='cerebral cortex'} # we're almost there!
        )
contrast$baseline.factors[[1]][,.SD,.SDcols = !uri_columns] %>% 
     gemma_kable()
category value predicate object summary ID factor.ID factor.category
genotype wild type genotype NA NA wild type genotype 120174 20541 genotype
timepoint 12 weeks NA NA 12 weeks 120179 20543 timepoint
contrast$experimental.factors[[1]][,.SD,.SDcols = !uri_columns] %>% 
     gemma_kable()
category value predicate object summary ID factor.ID factor.category
genotype HTT [human] huntingtin has_genotype CAG repeats HTT [human] huntingtin has_… 120175 20541 genotype
genotype HTT [human] huntingtin has_genotype Overexpression HTT [human] huntingtin has_… 120175 20541 genotype
timepoint 8 weeks NA NA 8 weeks 120178 20543 timepoint

A third place that can contain factorValues is the subsetFactor. Certain differential expression analyses exclude certain samples based on a given factor. In this example we can see that this analysis were only performed on samples from the cerebral cortex.

contrast$subsetFactor[[1]][,.SD,.SDcols = !uri_columns] %>%
     gemma_kable()
category value predicate object summary ID factor.ID factor.category
organism part cerebral cortex NA NA cerebral cortex 120173 20540 NA

The ids of the factor values included in baseline.factors and experimental.factors along with subsetFactor can be used to determine which samples represent a given contrast. For convenience, get_dataset_object function which is used to compile metadata and expression data of an experiment in a single object, includes resultSets and contrasts argument which will return the data already composed of samples representing a particular contrast.

obj <-  get_dataset_object("GSE48962",resultSets = contrast$result.ID,contrasts = contrast$contrast.ID,type = 'list')
obj[[1]]$design[,-1] %>% 
    head %>% gemma_kable()
organism part block timepoint genotype
TCW9469 cerebral cortex Device=HWUSI-EAS1563_0053:L… 12 weeks wild type genotype
TCM9450 cerebral cortex Device=HWI-EAS413_0047:Lane=1 12 weeks HTT [human] huntingtin has_…
ECW176 cerebral cortex Device=HWUSI-EAS1563_0071_F… 8 weeks wild type genotype
ECW178 cerebral cortex Device=HWUSI-EAS1563_0071_F… 8 weeks wild type genotype
ECW180 cerebral cortex Device=HWUSI-EAS1563_0072_F… 8 weeks wild type genotype
ECM175 cerebral cortex Device=HWUSI-EAS1563_0071_F… 8 weeks HTT [human] huntingtin has_…

We suggested that the contrast.ID of a contrast also corresponded to a column in the differential expression results, acquired by get_differential_expression_values. We can use what we have learned to take a look at the expression of genes at the top of the phenotype, treatment interaction. Each result.ID returns its separate table when accessing differential expression values.

dif_vals <- get_differential_expression_values('GSE48962')
dif_vals[[as.character(contrast$result.ID)]] %>% head %>%  
     gemma_kable()
Probe NCBIid GeneSymbol GeneName pvalue corrected_pvalue rank contrast_120175_120178_log2fc contrast_120175_120178_tstat contrast_120175_120178_pvalue
21975 21975 Top3a topoisomerase (DNA) III alpha 0.5748 0.8126 0.7072 -0.2205 -0.5848 0.5748
64580 64580 Ndst4 N-deacetylase/N-sulfotransf… 0.1378 0.5046 0.2730 -3.0116 -1.6487 0.1378
208994 208994 Fam83b family with sequence simila… 0.4930 0.7655 0.6440 -1.4014 -0.7184 0.4930
22414 22414 Wnt2b wingless-type MMTV integrat… 0.4189 0.7179 0.5836 -0.6944 -0.8521 0.4189
102640604 102640604 Gm36629 predicted gene, 36629 0.4834 0.7592 0.6366 1.3084 0.7349 0.4834
74237 74237 Tubgcp2 tubulin, gamma complex comp… 0.7909 0.9210 0.8587 -0.0923 -0.2742 0.7909

To get the top genes found associated with this interaction we access the columns with the correct contrast.ID.

# getting the top 10 genes
top_genes <- dif_vals[[as.character(contrast$result.ID)]] %>% 
    arrange(across(paste0('contrast_',contrast$contrast.ID,'_pvalue'))) %>% 
    filter(GeneSymbol!='' | grepl("|",GeneSymbol,fixed = TRUE)) %>% # remove blank genes or probes with multiple genes
    {.[1:10,]}
top_genes %>% select(Probe,NCBIid,GeneSymbol) %>% 
     gemma_kable()
Probe NCBIid GeneSymbol
67993 67993 Nudt12
100502959 100502959 AV051173
12153 12153 Bmp1
58212 58212 Srrm3
16969 16969 Zbtb7a
19732 19732 Rgl2
382051 382051 Pdp2
319278 319278 Shfl
11501 11501 Adam8
101883 101883 Igflr1

We can then use the expression data returned by get_dataset_object to examine the expression values for these genes.

exp_subset<- obj[[1]]$exp %>% 
    filter(Probe %in% top_genes$Probe)
genes <- top_genes$GeneSymbol

# ordering design file
design <- obj[[1]]$design %>% arrange(genotype,timepoint)

# shorten the resistance label a bit
design$genotype[grepl('HTT',design$genotype)] = "Huntington Model"

exp_subset[,.SD,.SDcols = rownames(design)] %>% t  %>% scale %>% t %>%
    pheatmap(cluster_rows = FALSE,cluster_cols = FALSE,labels_row = genes,
             annotation_col =design %>% select(genotype,timepoint))

Session info

## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] purrr_1.0.2                 listviewer_4.0.0           
##  [3] viridis_0.6.5               viridisLite_0.4.2          
##  [5] pheatmap_1.0.12             SummarizedExperiment_1.35.0
##  [7] Biobase_2.65.0              GenomicRanges_1.57.0       
##  [9] GenomeInfoDb_1.41.0         IRanges_2.39.0             
## [11] S4Vectors_0.43.0            BiocGenerics_0.51.0        
## [13] MatrixGenerics_1.17.0       matrixStats_1.3.0          
## [15] ggrepel_0.9.5               ggplot2_3.5.1              
## [17] dplyr_1.1.4                 data.table_1.15.4          
## [19] gemma.R_3.1.1               BiocStyle_2.33.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1        farver_2.1.1            fastmap_1.1.1          
##  [4] digest_0.6.35           timechange_0.3.0        lifecycle_1.0.4        
##  [7] magrittr_2.0.3          compiler_4.4.0          rlang_1.1.3            
## [10] sass_0.4.9              tools_4.4.0             utf8_1.2.4             
## [13] yaml_2.3.8              knitr_1.46              labeling_0.4.3         
## [16] S4Arrays_1.5.0          htmlwidgets_1.6.4       bit_4.0.5              
## [19] curl_5.2.1              DelayedArray_0.31.0     xml2_1.3.6             
## [22] RColorBrewer_1.1-3      abind_1.4-5             withr_3.0.0            
## [25] grid_4.4.0              fansi_1.0.6             colorspace_2.1-0       
## [28] scales_1.3.0            tinytex_0.50            cli_3.6.2              
## [31] rmarkdown_2.26          crayon_1.5.2            generics_0.1.3         
## [34] rstudioapi_0.16.0       httr_1.4.7              cachem_1.0.8           
## [37] stringr_1.5.1           zlibbioc_1.51.0         assertthat_0.2.1       
## [40] BiocManager_1.30.22     XVector_0.45.0          vctrs_0.6.5            
## [43] Matrix_1.7-0            jsonlite_1.8.8          bookdown_0.39          
## [46] bit64_4.0.5             magick_2.8.3            systemfonts_1.0.6      
## [49] jquerylib_0.1.4         glue_1.7.0              lubridate_1.9.3        
## [52] stringi_1.8.3           gtable_0.3.5            UCSC.utils_1.1.0       
## [55] munsell_0.5.1           tibble_3.2.1            pillar_1.9.0           
## [58] rappdirs_0.3.3          htmltools_0.5.8.1       GenomeInfoDbData_1.2.12
## [61] R6_2.5.1                evaluate_0.23           kableExtra_1.4.0       
## [64] lattice_0.22-6          highr_0.10              memoise_2.0.1          
## [67] bslib_0.7.0             Rcpp_1.0.12             svglite_2.1.3          
## [70] gridExtra_2.3           SparseArray_1.5.0       xfun_0.43              
## [73] pkgconfig_2.0.3