# Citation

Please cite the following article when using GOSemSim:

G Yu, F Li, Y Qin, X Bo, Y Wu, S Wang. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 2010, 26(7):976-978. doi: 10.1093/bioinformatics/btq064.

# Introduction

Functional similarity of gene products can be estimated by controlled biological vocabularies, such as Gene Ontology (GO) and Disease Ontology (DO). GO comprises of three orthogonal ontologies, i.e. molecular function (MF), biological process (BP), and cellular component (CC).

Four methods including Resnik1, Jiang2, Lin3 and Schlicker4 have been presented to determine the semantic similarity of two GO terms based on the annotation statistics of their common ancestor terms. Wang5 proposed a method to measure the similarity based on the graph structure of GO. Each of these methods has its own advantages and weaknesses.

GOSemSim package6 is developed to compute semantic similarity among GO terms, sets of GO terms, gene products, and gene clusters, providing five methods mentioned above. We have developed another package, DOSE7, for measuring semantic similarity among Disease Ontology (DO) terms and gene products at disease perspective.

# Semantic Similarity Measurement Based on GO

## Information content-based methods

Four methods proposed by Resnik1, Jiang2, Lin3 and Schlicker4 are information content (IC) based, which depend on the frequencies of two GO terms involved and that of their closest common ancestor term in a specific corpus of GO annotations. The information content of a GO term is computed by the negative log probability of the term occurring in GO corpus. A rarely used term contains a greater amount of information.

The frequency of a term t is defined as: $$p(t) = \frac{n_{t'}}{N} | t' \in \left\{t, \; children\: of\: t \right\}$$

where $$n_{t'}$$ is the number of term $$t'$$, and $$N$$ is the total number of terms in GO corpus.

Thus the information content is defined as: $$IC(t) = -\log(p(t))$$

As GO allow multiple parents for each concept, two terms can share parents by multiple paths. IC-based methods calculate similarity of two GO terms based on the information content of their closest common ancestor term, which was also called most informative information ancestor (MICA).

### Resnik method

The Resnik method is defined as: $$sim_{Resnik}(t_1,t_2) = IC(MICA)$$

### Lin method

The Lin method is defined as: $$sim_{Lin}(t_1,t_2) = \frac{2IC(MICA)}{IC(t_1)+IC(t_2)}$$

### Rel method

The Relevance method, which was proposed by Schlicker, combine Resnik’s and Lin’s method and is defined as: $$sim_{Rel}(t_1,t_2) = \frac{2IC(MICA)(1-p(MICA))}{IC(t_1)+IC(t_2)}$$

### Jiang method

The Jiang and Conrath’s method is defined as: $$sim_{Jiang}(t_1,t_2) = 1-\min(1, IC(t_1) + IC(t_2) - 2IC(MICA))$$

## Graph-based method

Graph-based methods using the topology of GO graph structure to compute semantic similarity. Formally, a GO term A can be represented as $$DAG_{A}=(A,T_{A},E_{A})$$ where $$T_{A}$$ is the set of GO terms in $$DAG_{A}$$, including term A and all of its ancestor terms in the GO graph, and $$E_{A}$$ is the set of edges connecting the GO terms in $$DAG_{A}$$.

### Wang method

To encode the semantic of a GO term in a measurable format to enable a quantitative comparison, Wang5 firstly defined the semantic value of term A as the aggregate contribution of all terms in $$DAG_{A}$$ to the semantics of term A, terms closer to term A in $$DAG_{A}$$ contribute more to its semantics. Thus, defined the contribution of a GO term $$t$$ to the semantic of GO term $$A$$ as the S-value of GO term $$t$$ related to term $$A$$. For any of term $$t$$ in $$DAG_{A}$$, its S-value related to term $$A$$, $$S_{A}(\textit{t})$$ is defined as:

$$\left\{\begin{array}{l} S_{A}(A)=1 \\ S_{A}(\textit{t})=\max\{w_{e} \times S_{A}(\textit{t}') | \textit{t}' \in children \: of(\textit{t}) \} \; if \: \textit{t} \ne A \end{array} \right.$$

where $$w_{e}$$ is the semantic contribution factor for edge $$e \in E_{A}$$ linking term $$t$$ with its child term $$t'$$. Term $$A$$ contributes to its own is defined as 1. After obtaining the S-values for all terms in $$DAG_{A}$$, the semantic value of DO term A, $$SV(A)$$, is calculated as:

$$SV(A)=\displaystyle\sum_{t \in T_{A}} S_{A}(t)$$

Thus given two GO terms A and B, the semantic similarity between these two terms is defined as:

$$sim_{Wang}(A, B) = \frac{\displaystyle\sum_{t \in T_{A} \cap T_{B}}{S_{A}(t) + S_{B}(t)}}{SV(A) + SV(B)}$$

where $$S_{A}(\textit{t})$$ is the S-value of GO term $$t$$ related to term $$A$$ and $$S_{B}(\textit{t})$$ is the S-value of GO term $$t$$ related to term $$B$$.

This method proposed by Wang5 determines the semantic similarity of two GO terms based on both the locations of these terms in the GO graph and their relations with their ancestor terms.

## Supported organisms

For IC-based methods, information of GO term is species specific. We need to calculate IC for all GO terms of a species before we measure semantic similarity. GOSemSim support all organisms that have an OrgDb object available.

Bioconductor have already provided OrgDb for about 20 species, see http://bioconductor.org/packages/release/BiocViews.html#___OrgDb.

We can build query OrgDb online via AnnotationHub. For example:

library(AnnotationHub)
hub <- AnnotationHub()
q <- query(hub, "Cricetulus")
id <- q$ah_id[length(q)] Cgriseus <- hub[[id]] If organism is not supported by AnnotationHub, user can use AnnotationForge to build OrgDb. Once we have OrgDb, we can build annotation data needed by GOSemSim via godata function. library(GOSemSim) hsGO <- godata('org.Hs.eg.db', ont="MF") User can set computeIC=FALSE if they only want to use Wang’s method. ## goSim and mgoSim function In GOSemSim, we implemented all these IC-based and graph-based methods. goSim function calculates semantic similarity between two GO terms, while mgoSim function calculates semantic similarity between two sets of GO terms. goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Jiang") ## [1] 0.171 goSim("GO:0004022", "GO:0005515", semData=hsGO, measure="Wang") ## [1] 0.158 go1 = c("GO:0004022","GO:0004024","GO:0004174") go2 = c("GO:0009055","GO:0005515") mgoSim(go1, go2, semData=hsGO, measure="Wang", combine=NULL) ## GO:0009055 GO:0005515 ## GO:0004022 0.205 0.158 ## GO:0004024 0.185 0.141 ## GO:0004174 0.205 0.158 mgoSim(go1, go2, semData=hsGO, measure="Wang", combine="BMA") ## [1] 0.192 # Gene Semantic Similarity Measurement On the basis of semantic similarity between GO terms, GOSemSim can also compute semantic similarity among sets of GO terms, gene products, and gene clusters. Suppose we have gene $$g_1$$ annotated by GO terms sets $$GO_{1}=\{go_{11},go_{12} \cdots go_{1m}\}$$ and $$g_2$$ annotated by $$GO_{2}=\{go_{21},go_{22} \cdots go_{2n}\}$$, GOSemSim implemented four methods which called max, avg, rcmax, and BMA to combine semantic similarity scores of multiple GO terms. The similarities among gene products and gene clusters which annotated by multiple GO terms were also calculated by the same combine methods mentioned above. ## Combine methods ### max The max method calculates the maximum semantic similarity score over all pairs of GO terms between these two GO term sets. $$sim_{max}(g_1, g_2) = \displaystyle\max_{1 \le i \le m, 1 \le j \le n} sim(go_{1i}, go_{2j})$$ ### avg The avg calculates the average semantic similarity score over all pairs of GO terms. $$sim_{avg}(g_1, g_2) = \frac{\displaystyle\sum_{i=1}^m\sum_{j=1}^nsim(go_{1i}, go_{2j})}{m \times n}$$ ### rcmax Similarities among two sets of GO terms form a matrix, the rcmax method uses the maximum of RowScore and ColumnScore, where RowScore (or ColumnScore) is the average of maximum similarity on each row (or column). $$sim_{rcmax}(g_1, g_2) = \max(\frac{\displaystyle\sum_{i=1}^m \max_{1 \le j \le n} sim(go_{1i}, go_{2j})}{m},\frac{\displaystyle\sum_{j=1}^n \max_{1 \le i \le m} sim(go_{1i},go_{2j})}{n})$$ ### BMA The BMA method, used the Best-Match Average strategy, calculates the average of all maximum similarities on each row and column, and is defined as: $$sim_{BMA}(g_1, g_2) = \frac{\displaystyle\sum_{1=i}^m \max_{1 \le j \le n}sim(go_{1i}, go_{2j}) + \displaystyle\sum_{1=j}^n \max_{1 \le i \le m}sim(go_{1i}, go_{2j})} {m+n}$$ ## geneSim and mgeneSim In GOSemSim, we implemented geneSim to calculate semantic similarity between two gene products, and mgeneSim to calculate semantic similarity among multiple gene products. geneSim("241", "251", semData=hsGO, measure="Wang", combine="BMA") ##$geneSim
## [1] 0.141
##
## $GO1 ## [1] "GO:0005515" "GO:0047485" "GO:0050544" ## ##$GO2
## [1] "GO:0004035"
mgeneSim(genes=c("835", "5261","241", "994"),
semData=hsGO, measure="Wang",verbose=FALSE)
##        835  5261   241   994
## 835  1.000 0.505 0.634 0.602
## 5261 0.505 1.000 0.485 0.408
## 241  0.634 0.485 1.000 0.557
## 994  0.602 0.408 0.557 1.000
mgeneSim(genes=c("835", "5261","241", "994"),
semData=hsGO, measure="Rel",verbose=FALSE)
##        835  5261   241   994
## 835  0.924 0.308 0.252 0.481
## 5261 0.308 0.904 0.208 0.328
## 241  0.252 0.208 0.876 0.240
## 994  0.481 0.328 0.240 0.903

By default, godata function use ENTREZID as keytype, and the input ID type is ENTREZID. User can use other ID types such as ENSEMBL, UNIPROT, REFSEQ, ACCNUM, SYMBOL et al.

Here as an example, we use SYMBOL as keytype and calculate semantic similarities among several genes by using their gene symbol as input.

hsGO2 <- godata('org.Hs.eg.db', keytype = "SYMBOL", ont="MF", computeIC=FALSE)
genes <- c("CDC45", "MCM10", "CDC20", "NMU", "MMP1")
mgeneSim(genes, semData=hsGO2, measure="Wang", combine="BMA", verbose=FALSE)
##       CDC45 MCM10 CDC20   NMU  MMP1
## CDC45 1.000 0.823 0.600 0.570 0.150
## MCM10 0.823 1.000 0.718 0.702 0.119
## CDC20 0.600 0.718 1.000 0.889 0.136
## NMU   0.570 0.702 0.889 1.000 0.137
## MMP1  0.150 0.119 0.136 0.137 1.000

Users can also use clusterProfiler::bitr to translate biological IDs.

## clusterSim and mclusterSim

We also implemented clusterSim for calculating semantic similarity between two gene clusters and mclusterSim for calculating semantic similarities among multiple gene clusters.

gs1 <- c("835", "5261","241", "994", "514", "533")
gs2 <- c("578","582", "400", "409", "411")
clusterSim(gs1, gs2, semData=hsGO, measure="Wang", combine="BMA")
## [1] 0.668
library(org.Hs.eg.db)
x <- org.Hs.egGO
hsEG <- mappedkeys(x)
set.seed <- 123
clusters <- list(a=sample(hsEG, 20), b=sample(hsEG, 20), c=sample(hsEG, 20))
mclusterSim(clusters, semData=hsGO, measure="Wang", combine="BMA")
##       a     b     c
## a 1.000 0.689 0.699
## b 0.689 1.000 0.703
## c 0.699 0.703 1.000

# Applications

GOSemSim was cited by more than 200 papers and had been applied to many research domains, including:

Find out more on https://guangchuangyu.github.io/GOSemSim/featuredArticles/.

# GO enrichment analysis

GO enrichment analysis can be supported by our package clusterProfiler8, which supports hypergeometric test and Gene Set Enrichment Analysis (GSEA). Enrichment results across different gene clusters can be compared using compareCluster function.

# Disease Ontology Semantic and Enrichment analysis

Disease Ontology (DO) annotates human genes in the context of disease. DO is an important annotation in translating molecular findings from high-throughput data to clinical relevance. DOSE7 supports semantic similarity computation among DO terms and genes. Enrichment analysis including hypergeometric model and GSEA are also implemented to support discovering disease associations of high-throughput biological data.

# MeSH enrichment and semantic analyses

MeSH (Medical Subject Headings) is the NLM controlled vocabulary used to manually index articles for MEDLINE/PubMed. meshes supports enrichment (hypergeometric test and GSEA) and semantic similarity analyses for more than 70 species.

# Session Information

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets
## [8] methods   base
##
## other attached packages:
## [1] GOSemSim_2.4.1       GO.db_3.5.0          org.Hs.eg.db_3.5.0
## [4] AnnotationDbi_1.40.0 IRanges_2.12.0       S4Vectors_0.16.0
## [7] Biobase_2.38.0       BiocGenerics_0.24.0
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.15    knitr_1.19      magrittr_1.5    bit_1.1-12
##  [5] rlang_0.1.6     stringr_1.2.0   blob_1.1.0      tools_3.4.3
##  [9] DBI_0.7         htmltools_0.3.6 yaml_2.1.16     bit64_0.9-7
## [13] rprojroot_1.3-2 digest_0.6.15   tibble_1.4.2    prettydoc_0.2.1
## [17] memoise_1.1.0   evaluate_0.10.1 RSQLite_2.0     rmarkdown_1.8
## [21] stringi_1.1.6   pillar_1.1.0    compiler_3.4.3  backports_1.1.2
## [25] pkgconfig_2.0.1

# References

1. Philip, R. Semantic similarity in a taxonomy: An Information-Based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research 11, 95–130 (1999).

2. Jiang, J. J. & Conrath, D. W. Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of 10th International Conference on Research In Computational Linguistics (1997). at <http://www.citebase.org/abstract?id=oai:arXiv.org:cmp-lg/9709008>

3. Lin, D. An Information-Theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning 296—304 (1998). doi:10.1.1.55.1832

4. Schlicker, A., Domingues, F. S., Rahnenführer, J. & Lengauer, T. A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics 7, 302 (2006).

5. Wang, J. Z., Du, Z., Payattakool, R., Yu, P. S. & Chen, C.-F. A new method to measure the semantic similarity of go terms. Bioinformatics (Oxford, England) 23, 1274–81 (2007).

6. Yu, G. et al. GOSemSim: An r package for measuring semantic similarity among go terms and gene products. Bioinformatics 26, 976–978 (2010).

7. Yu, G., Wang, L.-G., Yan, G.-R. & He, Q.-Y. DOSE: An r/bioconductor package for disease ontology semantic and enrichment analysis. Bioinformatics 31, 608–609 (2015).

8. Yu, G., Wang, L.-G., Han, Y. & He, Q.-Y. clusterProfiler: an r package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16, 284–287 (2012).