1 Introduction

Gene lists derived from the results of genomic analyses are rich in biological information. For instance, differentially expressed genes (DEGs) from a microarray or RNA-Seq analysis are related functionally in terms of their response to a treatment or condition. Gene lists can vary in size, up to several thousand genes, depending on the robustness of the perturbations or how widely different the conditions are biologically. Having a way to associate biological relatedness between hundreds and thousands of genes systematically is impractical by manually curating the annotation and function of each gene. Over-representation analysis (ORA) of genes was developed to identify biological themes. Given a Gene Ontology (GO) and an annotation of genes that indicate the categories each one fits into, significance of the over-representation of the genes within the ontological categories is determined by a Fisher’s exact test or modeling according to a hypergeometric distribution. Comparing a small number of enriched biological categories for a few samples is manageable using Venn diagrams or other means for assessing overlaps. However, with hundreds of enriched categories and many samples, the comparisons are laborious. Furthermore, if there are enriched categories that are shared between samples, trying to represent a common theme across them is highly subjective. goSTAG uses GO subtrees to tag and annotate genes within a set. goSTAG visualizes the similarities between the over-representation of DEGs by clustering the p-values from the enrichment statistical tests and labels clusters with the GO term that has the most paths to the root within the subtree generated from all the GO terms in the cluster.

2 Installation

The goSTAG package is available through Bioconductor and you can install it by running:

## try http:// if https:// URLs are not supported

Once the library is installed, you can load it by running:


3 Preparing Data for Analysis

This section describes how to prepare your data to perform a goSTAG analysis.

3.1 Gene Lists

The key input data that goSTAG requires is a set of gene lists. These gene lists can be derived from any type of gene-based analysis, such as differential expression analysis or ChIP-seq peak targets. They can be generated within R using another package, or using an external tool and then loaded into R.

Once you have generated your gene lists and loaded them into R, you must convert them into the format that goSTAG requires, which is a list of vectors. Each element of the list corresponds to a gene list, and the vectors contain the genes in the gene list.

We provide an example set of gene lists in the goSTAG package.

data( goSTAG_example_gene_lists )
head( lapply( goSTAG_example_gene_lists, head ) )
## $Day1
## [1] "CNN2"    "GHDC"    "GALNT10" "LIPT1"   "TPP1"    "TBPL2"  
## $Day2
## [1] "GPR37L1"  "TNFSF13B" "OLAH"     "WWC1"     "NYNRIN"   "MRGPRX4" 
## $Day3
## [1] "MAD2L1BP"      "MRPL47"        "SLIRP"         "MEF2BNB-MEF2B"
## [5] "TMEM245"       "UBA6"

For convienience, we have provided a function for loading external gene lists.

The first option is to put your gene lists into a single GMT file. This GMT file is a tab-delimited file containing one gene list per line. The first entry in the line is the name of the gene list, the second entry in the line is a description of the gene list (which goSTAG ignores), and the rest of the entries are the genes. For more information on the GMT format, see here.

Once you have created your GMT file, you can load it using the loadGeneLists function.

gene_lists <- loadGeneLists( "gene_lists.gmt" )

The second option is to have your gene lists in separate tab-delimited text files within a directory.

You can load the gene lists within this directory using the loadGeneLists function with the type argument set to "DIR".

gene_lists <- loadGeneLists( "/gene/lists/directory", type = "DIR" )

By default, the loadGeneLists function expects the files in this directory to have no header and to have the genes in the first column. You can modify this behavior using the header argument and the col argument.

In this example, the files contain a header that needs to be ignored, and the genes are in the 7th column.

gene_lists <- loadGeneLists( "/gene/lists/directory", type = "DIR", header = TRUE, col = 7 )

3.2 GO Terms

The second input data required by goSTAG is a set of GO terms and the genes associated with those GO terms. The GO terms should all belong to the same GO domain (either biological process, cellular component, or molecular function). The expected format is a list of vectors, where each list corresponds to a GO term, and the vectors contain the genes associated with the GO terms. The names of the lists should be the GO IDs of the GO terms. The list must also have an additional element named “ALL”, which is a vector that contains all annotated genes.

The goSTAG package provides the loadGOTerms function for loading the GO terms and gene associations from BioMart using the biomaRt package.

go_terms <- loadGOTerms()
head( lapply( go_terms, head ) )
## $`GO:0006810`
## [1] "SLC25A26" "GJA3"     "GJB2"     "HERC1"    "SLC35A2"  "BAG6"    
## $`GO:0048011`
## [1] "BTC"   "FGF7"  "FURIN" "APH1B" "NDN"   "AGO2" 
## $`GO:0007173`
## [1] "BTC"     "FGF7"    "AGO2"    "FGF9"    "MAPKAP1" "FGF2"   
## $`GO:0008543`
## [1] "BTC"     "FGF7"    "AGO2"    "FGF9"    "MAPKAP1" "FGF2"   
## $`GO:0048015`
## [1] "BTC"     "FGF7"    "UBE2C"   "AGO2"    "FGF9"    "MAPKAP1"
## $`GO:0008284`
## [1] "BTC"   "FGF7"  "EDN3"  "RPS4X" "AVP"   "FGF9"
head( go_terms[["ALL"]] )
## [1] "SLC25A26" "BTC"      "GAST"     "HAP1"     "JUP"      "FKBP10"

Since loading the GO terms from BioMart can take several minutes, the default behavior of the loadGOTerms function is to use a previously generated version that has been archived in the goSTAG package, which is much faster.

You can load the most recent version of the GO terms and gene associations directly from BioMart by setting the use_archived argument to FALSE.

go_terms <- loadGOTerms( use_archived = FALSE )

By default, the loadGOTerms function will associate genes based on the annotation for human. It will also load GO terms for the biological process domain and remove any GO terms that has fewer than 5 genes associated with it. To modify this behavior, use the species, domain, and min_num_genes arguments. Currently, the supported options for the species argument are “human”, “mouse”, or “rat”. Acceptable options for the domain argument are “BP” (biological process), “CC” (cellular component), or “MF” (molecular function).

This example will load GO terms for mouse, using the molecular function domain, and remove any GO terms with fewer than 10 genes.

go_terms_mouse <- loadGOTerms( species = "mouse", domain = "MF", min_num_genes = 10 )

4 Running goSTAG

Once your gene lists and GO terms are loaded, you are ready to perform a goSTAG analysis. This section will detail the four steps in the goSTAG analysis.

4.1 Generating the Enrichment Matrix

The first step is to create a matrix of GO enrichment scores. The goSTAG package calculates an enrichment score for each gene list and GO term using the hypergeometric test. It will remove any GO term that has no samples with significant enrichment. The enrichment score is calculated as the -log10 p-value of the hypergeometric test. For tests without significant enrichment, the enrichment score is set to 0.

To generate an enrichment matrix from your gene lists and GO terms, use the performGOEnrichment function.

enrichment_matrix <- performGOEnrichment( goSTAG_example_gene_lists, go_terms )
##                Day1     Day2     Day3
## GO:0007165 0.000000 0.000000 1.343007
## GO:0007369 0.000000 0.000000 1.389698
## GO:0042127 0.000000 0.000000 1.729184
## GO:0007249 1.862509 1.766861 1.654304
## GO:0008380 1.345807 0.000000 1.306134
## GO:0016049 0.000000 1.436532 0.000000

This function produces a matrix of enrichment scores, where columns are gene lists and rows are GO terms.

By default, this function will remove any GO term where none of the samples have a p-value of less than 0.05. You can specify whether to use a nominal p-value or an adjusted p-value using the filter_method argument, and you can change the significance threshold using the significance_threshold argument. This function uses the p.adjust method to adjust the p-value, and you can specify the method used to adjust the p-values with the p.adjust_method argument. See the method argument for the p.adjust function for available options.

This example generates the enrichment matrix and uses a Benjamini-Hochberg FDR of 0.3 to filter the GO terms.

enrichment_matrix_FDR <- performGOEnrichment( goSTAG_example_gene_lists, go_terms, filter_method = "p.adjust", significance_threshold = 0.3, p.adjust_method = "BH" )

4.2 Hierarchical Clustering

After generating the enrichment matrix, the next step is to cluster the GO terms. Before clustering the GO terms into groups, you must first perform hierarchical clustering.

You can perform the hierarchical clustering using the performHierarchicalClustering function.

hclust_results <- performHierarchicalClustering( enrichment_matrix )
## Call:
## hclust(d = as.dist(1 - abs(cor(t(final_matrix)))), method = clustering_method)
## Cluster method   : average 
## Number of objects: 156

This function uses the hclust function and returns an object of class hclust containing the hierarchical clustering tree.

By default, this function uses one minus the absolute value of the correlation to measure distance. You can change this behavior using the distance_method argument. Specifying anything other than correlation will cause this function to use the dist function to measure distance. The method specified in the distance_method argument is the method that the dist function will use. See the method argument for the dist function for available options.

This function uses the hclust function to perform the hierarchical clustering, with a default agglomeration method of average. You can change this by specifying the clustering_method argument.

This example performs hierarchical clustering on the GO terms using Euclidean distance and complete agglomeration.