Contents

1 Introduction to OnASSis

Public repositories of biological omics data contain thousands of experiments. While these resources are extrenely useful, those data are difficult to mine. The annotation of the associated metadata with controlled vocabularies or ontology terms can facilitate the retrieval of the datasets of interest (Galeota and Pelizzola 2016). OnASSiS (Ontology Annotations and Semantic Similarity software) is a package aimed at matching metadata associated with biological experiments with concepts from ontologies, allowing the construction of semantically structured omics datasets, possibly representing various data types from independent studies. The recognition of entities specific for a domain allows the retrieval of samples related to a given cell type or experimental condition, but also allows unravelling previously unanticipated relationships between experiments. Onassis applies Natural Language Processing tools to annotate sample’s and experiments’ descriptions, recognizing concepts from a multitude of biomedical ontologies and quantifying the similarities between pairs or groups of query studies. Moreover, it assists the semantically-driven analysis of the corresponding omics data. In particular the software includes modules to enable:

Onassis relies on Conceptmapper, an Apache UIMA (Unstructured Information Management Architecture) dictionary lookup tool to retrieve dictionary terms in a given text. https://uima.apache.org/downloads/sandbox/ConceptMapperAnnotatorUserGuide/ConceptMapperAnnotatorUserGuide.html
In particular, the ccp-nlp Conceptmapper wrapper, specific for the biomedical domain, implements a pipeline through which it is possible to retrieve concepts from OBO ontologies in any given text with different adjustable options (Verspoor et al. 2009).

Onassis features can be easily accessed through a main class named Onassis, having as slots ‘dictionary’, ‘entities’, ‘similarity’ and ‘scores’. In the following sections we first show details on the usage of classes and methods constituting the building blocks of a semantically-driven integrative analysis workflow. Next, in Section 6 we show how the Onassis class wraps all these functions for a simplified access and usage. Regarding the input data, Onassis can handle any type of text, but is particularly well suited for the analysis of the metadata from Gene Expression Omnibus (GEO). Indeed, it allows associating concepts from any OBO ontology to GEO metadata retrieved using GEOmetadb. In general, any table or database (such as Sequence Read Archive (SRA) (Zhu et al. 2013) or Cistrome (Mei et al. 2017)) containing textual descriptions that can be easily imported in R as a data frame can be used as input for Onassis. Regarding the dictionary module, gene/protein symbols or epigenetic modifications can also be recognized in the text, in addition to ontology concepts. This can be particularly important, especially when dealing with experiments directed to specific factors or marks (such as ChIP-seq experiments). The similarity module uses different semantic similarity measures to determine the semantic similarity of concepts in a given ontology. This module has been developed on the basis of the Java slib http://www.semantic-measures-library.org/sml. The score module applies statistical tests to determine if omics data from samples annotated with different concepts, belonging to one or more ontologies, are significantly different.

2 Installation

To run Onassis Java (>= 1.8) is needed. To install the package please run the following code

source("https://bioconductor.org/biocLite.R")
biocLite("Onassis")

Onassis can be loaded with the following code

library(Onassis)

Some of the optional functions, which will be described in the following parts of the vignette, require additional libraries. These include:

3 Retrieving metadata from public repositories

One of the most straightforward ways to retrieve metadata of samples provided in GEO is through the GEOmetadb package. In order to use GEOmetadb through Onassis, the corresponding SQLite database should be available. This can be downloaded by Onassis (see below), and this step has to be performed only once. As described below, Onassis provides functions to facilitate the retrieval of specific GEO metadata without the need of explicitly making SQL queries to the database. Additionally, an example on how to access SRAdb metadata, is also provided.

3.1 Handling GEO (Gene Expression Omnibus) metadata

First, it is necessary to obtain a connection to the GEOmetadb SQLite database. If this were already downloaded, connectToGEODB returns a connection to the database given the full path to the SQLite database file. Alternatively, by setting download to TRUE the database is downloaded. The getGEOmetadata function can be used to retrieve the metadata related to specific GEO samples, taking as minimal parameters the connection to the database and one of the experiment types available. Optionally it is possible to specify the organism and the platform. The following code illustrates how to download the metadata corresponding to expression arrays, or DNA methylation sequencing experiments. The meth_metadata object, containing the results for the latter, was stored within Onassis. Therefore, the queries illustrated here can be skipped.

require('GEOmetadb')

## Running this function might take some time if the database (6.8GB) has to be downloaded.
geo_con <- connectToGEODB(download=TRUE)

#Showing the experiment types available in GEO
experiments <- experiment_types(geo_con)

#Showing the organism types available in GEO
species <- organism_types(geo_con)

#Retrieving Human gene expression metadata, knowing the GEO platform identifier, e.g. the Affymetrix Human Genome U133 Plus 2.0 Array
expression <- getGEOMetadata(geo_con, experiment_type='Expression profiling by array', gpl='GPL570')

#Retrieving the metadata associated to experiment type "Methylation profiling by high througput sequencing"
meth_metadata <- getGEOMetadata(geo_con, experiment_type='Methylation profiling by high throughput sequencing', organism = 'Homo sapiens')

Some of the experiment types available are the following:

experiments
Expression profiling by MPSS
Expression profiling by RT-PCR
Expression profiling by SAGE
Expression profiling by SNP array
Expression profiling by array
Expression profiling by genome tiling array
Expression profiling by high throughput sequencing
Genome binding/occupancy profiling by SNP array
Genome binding/occupancy profiling by array
Genome binding/occupancy profiling by genome tiling array

Some of the organisms available are the following:

species
Homo sapiens
Drosophila melanogaster
Mus musculus
Zea mays
Arabidopsis thaliana
Caenorhabditis elegans
Helicobacter pylori
Escherichia coli
Rattus norvegicus
Saccharomyces cerevisiae

As specified above, meth_metadata was previously saved and can be loaded from the Onassis package external data (hover on the table to view additional rows and columns):

meth_metadata <- readRDS(system.file('extdata', 'vignette_data', 'GEOmethylation.rds', package='Onassis'))
Table 1: GEOmetadb metadata for Methylation profiling by high throughput sequencing (only the first 10 entries are shown).
series_id gsm title gpl source_name_ch1 organism_ch1 characteristics_ch1 description experiment_title experiment_summary
1251 GSE42590 GSM1045538 2316_DLPFC_Control GPL10999 Brain (dorsolateral prefrontal cortex) Homo sapiens tissue: Heterogeneous brain tissue NA Genome-wide DNA methylation profiling of human dorsolateral prefrontal cortex Reduced representation bisulfite sequencing (RRBS)
511 GSE27432 GSM678217 hEB16d_H9_p65_RRBS GPL9115 embryoid body from hES H9 p65 Homo sapiens cell type: hEB16d_H9_p65 reduced representation bisulfite sequencing Genomic distribution and inter-sample variation of non-CG methylation across human cell types DNA methylation plays an important role in develop
2731 GSE58889 GSM1421876 Normal_CD19_11 GPL11154 Normal CD19+ cells Homo sapiens cell type: Normal CD19+ cells; disease status: healthy NA Methylation disorder in CLL We performed RRBS and WGBS on primary human chroni
1984 GSE50761 GSM1228607 Time Course Off-target Day 7 1 HBB133 GPL15520 K562 cells Homo sapiens cell line: K562 cells; target loci: Time Course Off-target Day 7 1 2013.03.16._MM364_analysis.csv Targeted DNA demethylation using TALE-TET1 fusion proteins Recent large-scale studies have defined genomewide
851 GSE36173 GSM882245 H1 human ES cells GPL10999 H1 human ES cells Homo sapiens cell line: H1 5-hmC whole genome bisulfite sequencing Base Resolution Analysis of 5-Hydroxymethylcytosine in the Mammalian Genome The study of 5-hydroxylmethylcytosines (5hmC), the
1966 GSE50761 GSM1228589 Time Course HB-6 Day 4 1 HBB115 GPL15520 K562 cells Homo sapiens cell line: K562 cells; target loci: Time Course HB-6 Day 4 1 2013.03.16._MM364_analysis.csv Targeted DNA demethylation using TALE-TET1 fusion proteins Recent large-scale studies have defined genomewide
1827 GSE50761 GSM1228450 Off target -650 to -850 3 RHOX117 GPL15520 293 cells Homo sapiens cell line: 293 cells; target loci: Off target -650 to -850 3 2013-07-23-MM195-288-394_analysis.csv Targeted DNA demethylation using TALE-TET1 fusion proteins Recent large-scale studies have defined genomewide
378 GSE26592 GSM655200 Endometrial Recurrent 5 GPL9052 Human endometrial specimen Homo sapiens tissue: Human endometrial specimen; cell type: primary tissues; disease status: Recurrent; chromatin selection: MBD protein MBDCap using MethylMiner Methylated DNA Enrichment Kit (Invitrogen, ME 10025); library strategy: Endometrial samples: MBDCao-seq. Breast cells: MBDCap-seq.; library selection: Endometrial samples: MBDCap. Breast cells: MBDCap-seq. Neighboring genomic regions influence differential methylation patterns of CpG islands in endometrial and breast cancers We report the global methylation patterns by MBDCa
1754 GSE50761 GSM1228377 Initial Screen RH-3 -250-+1 2 RHOX44 GPL15520 HeLa cells Homo sapiens cell line: HeLa cells; target loci: Initial Screen RH-3 -250-+1 2 2013-07-12-MM564_analysis.csv Targeted DNA demethylation using TALE-TET1 fusion proteins Recent large-scale studies have defined genomewide
2371 GSE54961 GSM1327281 Healthy Control GPL9052 Healthy Control Homo sapiens etiology: Healthy Control; tissue: Peripheral venous blood; molecule subtype: serum cell-free DNA Sample 1 Epigenome analysis of serum cell-free circulating DNA in progression of HBV-related Hepatocellular carcinoma Purpose: Aberrantly methylated DNA are hallmarks

3.2 Handling SRA (Sequence Read Archive) metadata

In this section we provide an example showing how it is possible to retrieve metadata from other sources such as SRA. This database is not directly supported by Onassis, since it is not available for Windows platforms. Hence, the code reported below is slightly more complicated, and exemplifies how to query the SRA database provided by the SRAdb package and store metadata of human ChIP-seq experiments within a data frame. Due to the size of the SRA database (2.4 GB for the compressed file, 36 GB for the .sqlite file), a sample of the results of the query is available within Onassis as external data (see below), and the example code illustrated here can be skipped.

# Optional download of SRAdb and connection to the corresponding sqlite file
require(SRAdb)
sqliteFileName <- '/pathto/SRAdb.sqlite'
sra_con <- dbConnect(SQLite(), sqliteFileName)

# Query for the ChIP-Seq experiments contained in GEO for human samples 
library_strategy <- 'ChIP-Seq' #ChIP-Seq data
library_source='GENOMIC' 
taxon_id=9606 #Human samples
center_name='GEO' #Data from GEO
 
# Query to the sample table 
samples_query <- paste0("select sample_accession, description, sample_attribute, sample_url_link from sample where taxon_id='", taxon_id, "' and sample_accession IS NOT NULL", " and center_name='", center_name, "'"  )

samples_df <- dbGetQuery(sra_con, samples_query)
samples <- unique(as.character(as.vector(samples_df[, 1])))

experiment_query <- paste0("select experiment_accession, center_name, title, sample_accession, sample_name, experiment_alias, library_strategy, library_layout, experiment_url_link, experiment_attribute from experiment where library_strategy='", library_strategy, "'" , " and library_source ='", library_source,"' ", " and center_name='", center_name, "'" )
experiment_df <- dbGetQuery(sra_con, experiment_query)

#Merging the columns from the sample and the experiment table
experiment_df <- merge(experiment_df, samples_df, by = "sample_accession")

# Replacing the '_' character with white spaces
experiment_df$sample_name <- sapply(experiment_df$sample_name, function(value) {gsub("_", " ", value)})
experiment_df$experiment_alias <- sapply(experiment_df$experiment_alias, function(value) {gsub("_", " ", value)})

sra_chip_seq <- experiment_df

The query returns a table with thousands of samples. Alternatively, as described above, a sample of this table useful for the subsequent examples can be retrieved in Onassis:

sra_chip_seq <- readRDS(system.file('extdata', 'vignette_data', 'GEO_human_chip.rds',  package='Onassis'))
Table 2: Metadata of ChIP-seq human samples obtained from SRAdb (first 10 entries)
sample_accession experiment_accession center_name title library_strategy library_layout experiment_url_link experiment_attribute description sample_attribute sample_url_link
5904 SRS421364 SRX278504 GEO GSM1142700: p53 ChIP LCL nutlin-3 treated; Homo sapiens; ChIP-Seq ChIP-Seq SINGLE - GEO Sample: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1142700 GEO Accession: GSM1142700 NA source_name: lymphoblastoid cells || cell type: nutlin-3 treated lymphoblastoid cells || coriell id: GM12878 || chip antibody: mouse monoclonal anti-human p53 (BD Pharmingen, cat# 554294) || BioSampleModel: Generic GEO Sample GSM1142700: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1142700
4981 SRS371783 SRX199902 GEO GSM1022674: UW_ChipSeq_A549_InputRep1 ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1022674 GEO Accession: GSM1022674 NA source_name: A549 || biomaterial_provider: ATCC || lab: UW || lab description: Stamatoyannopoulous - University of Washington || datatype: ChipSeq || datatype description: Chromatin IP Sequencing || cell: A549 || cell organism: human || cell description: epithelial cell line derived from a lung carcinoma tissue. (PMID: 175022), “This line was initiated in 1972 by D.J. Giard, et al. through explant culture of lung carcinomatous tissue from a 58-year-old caucasian male.” - ATCC, newly promoted to tier 2: not in 2011 analysis || cell karyotype: cancer || cell lineage: endoderm || cell sex: M || antibody: Input || antibody description: Control signal which may be subtracted from experimental raw signal before peaks are called. || treatment: None || treatment description: No special treatment or protocol applies || control: std || control description: Standard input signal for most experiments. || controlid: wgEncodeEH001904 || labexpid: DS18301 || labversion: WindowDensity-bin20-win+/-75 || replicate: 1 || BioSampleModel: Generic GEO Sample GSM1022674: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1022674
4619 SRS365824 SRX190055 GEO GSM945272: UW_ChipSeq_HRPEpiC_Input ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM945272 GEO Accession: GSM945272 NA source_name: HRPEpiC || biomaterial_provider: ScienCell || lab: UW || lab description: Stamatoyannopoulous - University of Washington || datatype: ChipSeq || datatype description: Chromatin IP Sequencing || cell: HRPEpiC || cell organism: human || cell description: retinal pigment epithelial cells || cell karyotype: normal || cell lineage: ectoderm || cell sex: U || antibody: Input || antibody description: Control signal which may be subtracted from experimental raw signal before peaks are called. || treatment: None || treatment description: No special treatment or protocol applies || control: std || control description: Standard input signal for most experiments. || controlid: wgEncodeEH000962 || labexpid: DS16014 || labversion: Bowtie 0.12.7 || replicate: 1 || BioSampleModel: Generic GEO Sample GSM945272: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM945272
911 SRS117344 SRX028649 GEO GSM608166: H3K27me3_K562_ChIP-seq_rep1 ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM608166 GEO Accession: GSM608166 NA source_name: chronic myeloid leukemia cell line || cell line: K562 || harvest date: 2008-06-12 || chip antibody: CST monoclonal rabbit rabbit anti-H3K27me3 || BioSampleModel: Generic GEO Sample GSM608166: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM608166
4244 SRS362733 SRX186665 GEO GSM1003469: Broad_ChipSeq_Dnd41_H3K79me2 ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1003469 GEO Accession: GSM1003469 NA source_name: Dnd41 || biomaterial_provider: DSMZ || datatype: ChipSeq || datatype description: Chromatin IP Sequencing || antibody antibodydescription: Rabbit polyclonal antibody raised against a peptide containing K79 di-methylation. Antibody Target: H3K79me2 || antibody targetdescription: H3K79me2 is a mark of the transcriptional transition region - the region between the initiation marks (K4me3, etc) and the elongation marks (K36me3). || antibody vendorname: Active Motif || antibody vendorid: 39143 || controlid: wgEncodeEH002434 || replicate: 1,2 || softwareversion: ScriptureVPaperR3 || cell sex: M || antibody: H3K79me2 || antibody antibodydescription: Rabbit polyclonal antibody raised against a peptide containing K79 di-methylation. Antibody Target: H3K79me2 || antibody targetdescription: H3K79me2 is a mark of the transcriptional transition region - the region between the initiation marks (K4me3, etc) and the elongation marks (K36me3). || antibody vendorname: Active Motif || antibody vendorid: 39143 || treatment: None || treatment description: No special treatment or protocol applies || control: std || control description: Standard input signal for most experiments. || controlid: Dnd41/Input/std || softwareversion: ScriptureVPaperR3 || BioSampleModel: Generic GEO Sample GSM1003469: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1003469
7502 SRS494656 SRX369112 GEO GSM1252315: CHG092; Homo sapiens; ChIP-Seq ChIP-Seq SINGLE - GEO Sample GSM1252315: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1252315 GEO Accession: GSM1252315 NA source_name: Gastric Primary Sample || tissuetype: Tumor || chip antibody: H3K4me1 || reads length: 101 || BioSampleModel: Generic GEO Sample GSM1252315: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1252315
2127 SRS266173 SRX099863 GEO GSM808752: MCF7_CTCF_REP1 ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM808752 GEO Accession: GSM808752: NA source_name: breast adenocarcinoma cells || cell type: breast adenocarcinoma cells || cell line: MCF7 || antibody: CTCF || BioSampleModel: Generic GEO Sample GSM808752: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM808752
6299 SRS468164 SRX332680 GEO GSM1204476: Input DNA for ChIP; Homo sapiens; ChIP-Seq ChIP-Seq SINGLE - GEO Sample GSM1204476: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1204476 GEO Accession: GSM1204476 NA source_name: MDAMB231 || cell line: MDAMB231 || chip antibody: input || BioSampleModel: Generic GEO Sample GSM1204476: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1204476
832 SRS115184 SRX027300 GEO GSM593367: H3K4me3_H3 ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM593367 GEO Accession: GSM593367 NA source_name: LCL || chip antibody: H3K4me3 || cell type: lymphoblastoid cell line || BioSampleModel: Generic GEO Sample GSM593367: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM593367
8638 SRS598154 SRX528309 GEO GSM1375207: H3_ChIPSeq_Human; Homo sapiens; ChIP-Seq ChIP-Seq SINGLE - GEO Sample GSM1375207: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1375207 GEO Accession: GSM1375207 NA source_name: H3_ChIPSeq_Human || donor age: adult || cell type: sperm || chip antibody: H3F3B || chip antibody vendor: Abnova || BioSampleModel: Generic GEO Sample GSM1375207: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1375207

4 Annotating text with ontology concepts

The Onassis EntityFinder class has methods for annotating any text with dictionary terms. More specifically, Onassis can take advantage of the OBO dictionaries (http://www.obofoundry.org/).

4.1 Data preparation

The findEntities method supports input text in the form of:

  • The path of a directory containing named documents.
    • The path of a single file containing multiple documents. In this case each row contains the name/identifier of the document followed by a ‘|’ separator and the text to annotate.

Alternatively, the annotateDF method supports input text in the form of a data frame. In this case each row represents a document; the first column has to be the document identifier; the remaining columns will be combined and contain the text to analyze. This option can be conveniently used with the metadata retrieved from GEOmetadb and SRAdb, possibly selecting a subset of the available columns.

4.2 Creation of a Conceptmapper Dictionary

Onassis handles the convertion of OBO dictionaries into a format suitable to Conceptmapper: XML files with a set of entries specified by the xml tag <token> with a canonical name (the name of the entry) and one or more variants (synonyms).

The constructor CMdictionary creates an instance of the class CMdictionary.

  • If an XML file containing the Conceptmapper dictionary is already available, it can be uploaded into Onassis indicating its path and setting the dictType option to “CMDICT”.
    • If the dictionary has to be built from an OBO ontology (OBO or OWL formats are supported), the path or URL to the corresponding file has to be provided and dictType has to be set to “OBO”. The synonymType argument can be set to EXACT_ONLY or ALL to consider only canonical concept names or also to include any synonym. The resulting XML file is written in the indicated outputdir.
    • Additionally, to facilitate the named entity recognition of specific targets, such in the case of ChIP-seq experiments, these can be included within a specific dictionary, and dictType has to be set to ENTREZ. If a specific Org.xx.eg.db Bioconductor library is installed and loaded, it can be indicated in the inputFileOrDb parameter as a character string, and gene names will be derived from it. Instead, if inputFileOrDb is empty and a specific species is indicated in the taxID parameter, gene names will be derived from the corresponding gene_info.gz file downloaded from NCBI (300MB). Finally, if dictType is set to TARGET, known histone post-translational modifications and epigenetic marks are also included, in addition to gene names.
# If a Conceptmapper dictionary is already available the dictType CMDICT can be specified and the corresponding file loaded
sample_dict <- CMdictionary(inputFileOrDb=system.file('extdata', 'cmDict-sample.cs.xml', package = 'Onassis'), dictType = 'CMDICT')

#Creation of a dictionary from the file sample.cs.obo available in OnassisJavaLibs
obo <- system.file('extdata', 'sample.cs.obo', package='OnassisJavaLibs')

sample_dict <- CMdictionary(inputFileOrDb=obo, outputDir=getwd(), synonymType='ALL')

# Creation of a dictionary for human genes/proteins. This requires org.Hs.eg.db to be installed
require(org.Hs.eg.db)
targets <- CMdictionary(dictType='TARGET', inputFileOrDb = 'org.Hs.eg.db', synonymType='EXACT')

The following XML markup code illustrates a sample of the Conceptmapper dictionary corresponding to the Brenda tissue ontology.

   <?xml version="1.0" encoding="UTF-8" ?>
   <synonym>
      <token id="http://purl.obolibrary.org/obo/BTO_0005205" canonical="cerebral artery">
        <variant base="cerebral artery"/>
      </token>
      <token id="http://purl.obolibrary.org/obo/BTO_0002179" canonical="184A1N4 cell">
        <variant base="184A1N4 cell"/>
        <variant base="A1N4 cell"/>
      </token>
      <token id="http://purl.obolibrary.org/obo/BTO_0003871" canonical="uterine endometrial cancer cell">
        <variant base="uterine endometrial cancer cell"/>
        <variant base="endometrial cancer cell"/>
        <variant base="uterine endometrial carcinoma cell"/>
        <variant base="endometrial carcinoma cell"/>
      </token>
  </synonym>

4.3 Setting the options for the annotator

Conceptmapper includes 7 different options controlling the annotation step. These are documented in detail in the documentation of the CMoptions function. They can be listed through the listCMOptions function. The CMoptions constructor instantiates an object of class CMoptions with the different parameters that will be required for the subsequent step of annotation. We also provided getter and setter methods for each of the 7 parameters.

#Creating a CMoptions object and showing hte default parameters 
opts <- CMoptions()  
show(opts)
## CMoptions object to set ConceptMapper Options
## SearchStrategy: CONTIGUOUS_MATCH
## CaseMatch: CASE_INSENSITIVE
## Stemmer: NONE
## StopWords: NONE
## OrderIndependentLookup: ON
## FindAllMatches: YES
## SynonymType: ALL

To obtain the list of all the possible combinations:

combinations <- listCMOptions()

To create a CMoptions object having has SynonymType ‘EXACT_ONLY’, that considers only exact synonyms, rather than ‘ALL’ other types included in OBO (RELATED, NARROW, BROAD)

myopts <- CMoptions(SynonymType='EXACT_ONLY')
myopts
## CMoptions object to set ConceptMapper Options
## SearchStrategy: CONTIGUOUS_MATCH
## CaseMatch: CASE_INSENSITIVE
## Stemmer: NONE
## StopWords: NONE
## OrderIndependentLookup: ON
## FindAllMatches: YES
## SynonymType: EXACT_ONLY

To change a given parameter, for example to use a search strategy based on the Longest match of not-necessarily contiguous tokens where overlapping matches are allowed:

#Changing the SearchStrategy parameter
SearchStrategy(myopts) <- 'SKIP_ANY_MATCH_ALLOW_OVERLAP'
myopts
## CMoptions object to set ConceptMapper Options
## SearchStrategy: SKIP_ANY_MATCH_ALLOW_OVERLAP
## CaseMatch: CASE_INSENSITIVE
## Stemmer: NONE
## StopWords: NONE
## OrderIndependentLookup: ON
## FindAllMatches: YES
## SynonymType: EXACT_ONLY

4.4 Running the entity finder

The class EntityFinder defines a type system and runs the Conceptmapper pipeline. It can search for concepts of any OBO ontology in a given text. The findEntities and annotateDF methods accept text within files or data.frame, respectively, as described in Section 4.1. The function EntityFinder automatically adapts to the provided input type, creates an instance of the EntityFinder class to initialize the type system and runs Conceptmapper with the provided options and dictionary. For example, to annotate the metadata derived from ChIP-seq experiments obtained from SRA with tissue and cell type concepts belonging to the sample ontology available in Onassis and containing tissues and cell names, the following code can be used:

sra_chip_seq <- readRDS(system.file('extdata', 'vignette_data', 'GEO_human_chip.rds',  package='Onassis'))
chipseq_dict_annot <- EntityFinder(sra_chip_seq[1:50, c('experiment_accession', 'title', 'experiment_attribute', 'sample_attribute', 'description')], dictionary=sample_dict, options=myopts)

The resulting data.frame contains, for each row, a match to the provided dictionary for the document/sample indicated in the first column. The annotation is reported with the id of the concept (term_id), its canonical name (term name), its URL in the obo format, and the matching sentence of the document.

Table 3: Annotating the metadata of DNA methylation sequencing experiments with a dictionary including CL (Cell line) and UBERON ontologies
sample_id term_id term_name term_url matched_sentence
SRX027300 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRX028649 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRX033328 CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000084 T lymphoma cells with 10 ug/ml DRB treatment, pol II ChIP || cell
SRX033328 CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000084 cell line: Jurkat || cell type: T
SRX033328 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRX033328 CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000084 cell type: T
SRX047080 UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/UBERON_0002367 Prostate
SRX047080 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRX080398 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell: HCPEpiC || cell organism: Human || cell description: Human Choroid Plexus Epithelial
SRX080398 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRX080398 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell organism: Human || cell description: Human Choroid Plexus Epithelial
SRX080398 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell description: Human Choroid Plexus Epithelial
SRX080398 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 Epithelial Cells || cell
SRX084599 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRX092417 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B cells || cell
SRX092417 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRX092417 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 Cell
SRX096365 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRX099863 UBERON_0000310 breast http://purl.obolibrary.org/obo/UBERON_0000310 breast
SRX099863 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell

The function filterTerms can be used to remove all the occurrences of unwanted terms, for example very generic terms.

chipseq_dict_annot <- filterTerms(chipseq_dict_annot, c('cell', 'tissue'))
(#tab:showchipresults_filtered)Filtered Annotations
sample_id term_id term_name term_url matched_sentence
3 SRX033328 CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000084 T lymphoma cells with 10 ug/ml DRB treatment, pol II ChIP || cell
4 SRX033328 CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000084 cell line: Jurkat || cell type: T
6 SRX033328 CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000084 cell type: T
7 SRX047080 UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/UBERON_0002367 Prostate
9 SRX080398 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell: HCPEpiC || cell organism: Human || cell description: Human Choroid Plexus Epithelial
11 SRX080398 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell organism: Human || cell description: Human Choroid Plexus Epithelial
12 SRX080398 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell description: Human Choroid Plexus Epithelial
13 SRX080398 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 Epithelial Cells || cell
15 SRX092417 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B cells || cell
19 SRX099863 UBERON_0000310 breast http://purl.obolibrary.org/obo/UBERON_0000310 breast
21 SRX101132 UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/UBERON_0002367 prostate
28 SRX129103 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B-cell
30 SRX150687 CL_0000222 mesodermal cell http://purl.obolibrary.org/obo/CL_0000222 cell: GM12878 || cell organism: human || cell description: B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Virus || cell karyotype: normal || cell lineage: mesoderm
31 SRX150687 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 cell: GM12878 || cell organism: human || cell description: B
33 SRX150687 CL_0000222 mesodermal cell http://purl.obolibrary.org/obo/CL_0000222 cell organism: human || cell description: B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Virus || cell karyotype: normal || cell lineage: mesoderm
34 SRX150687 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 cell organism: human || cell description: B
35 SRX150687 CL_0000222 mesodermal cell http://purl.obolibrary.org/obo/CL_0000222 cell description: B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Virus || cell karyotype: normal || cell lineage: mesoderm
36 SRX150687 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 cell description: B
37 SRX150687 CL_0000945 lymphocyte of B lineage http://purl.obolibrary.org/obo/CL_0000945 B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Virus || cell karyotype: normal || cell lineage: mesoderm || cell sex: F || treatment: None || treatment description: No special treatment or protocol applies || antibody: Pol2(phosphoS2) || antibody antibodydescription: Rabbit polyclonal against peptide conjugated to KLH derived from within residues 1600 - 1700 of
38 SRX150687 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Virus || cell

The function EntityFinder can also be used to identify the targeted entity of each ChIP-seq experiment, by retrieving gene names and epigenetic marks in the ChIP-seq metadata.

#Finding the TARGET entities
target_entities <- EntityFinder(input=sra_chip_seq[1:50, c('experiment_accession', 'title', 'experiment_attribute', 'sample_attribute', 'description')], options = myopts, dictionary=targets)
Table 4: Annotations of ChIP-seq test metadata obtained from SRAdb and stored into files with the TARGETs (genes and histone variants)
sample_id term_id term_name term_url matched_sentence
SRX027300 H3K4me3 H3K4me3 NA H3K4me3
SRX028649 H3K27me3 H3K27me3 NA H3K27me3
SRX080398 10664 CTCF NA CTCF
SRX084599 604 BCL6 NA BCL6
SRX096365 H3K4me2 H3K4me2 NA H3K4me2
SRX099863 10664 CTCF NA CTCF
SRX109450 H3K27me3 H3K27me3 NA H3K27me3
SRX113180 H3K4me2 H3K4me2 NA H3K4me2
SRX114958 23133 PHF8 NA PHF8
SRX114963 23512 SUZ12 NA SUZ12
SRX116426 10013 HDAC6 NA HDAC6
SRX150687 1283 CTD NA CTD
SRX155719 3297 HSF1 NA HSF1
SRX185917 2305 FOXM1 NA FOXM1
SRX186621 7975 MAFK NA MAFK
SRX186621 4778 NFE2 NA NFE2
SRX186665 H3K79me2 H3K79me2 NA H3K79me2
SRX186733 929 CD14 NA CD14
SRX186733 H3K79me2 H3K79me2 NA H3K79me2
SRX190202 6938 TCF12 NA TCF12

5 Semantic similarity

Once a set of samples is annotated, i.e. associated to a set of ontology concepts, Onassis allows the quantification of the similarity among these samples based on the semantic similarity between the corresponding concepts. Similarity is an Onassis class applying methods of the Java library slib (Harispe et al. 2014), which builds a semantic graph starting from OBO ontology concepts and their hierarchical relationships. The following methods are available and are automatically chosen depending on the settings of the Similarity function. The sim and groupsim methods allow the computation of semantic similarity between single terms (pairwise measures) and between group of terms (groupwise measures), respectively. Pairwise measures can be edge based, if they rely only on the structure of the ontology, or information-content based if they also consider the information that each term in the ontology carries. Rather, groupwise measures can be indirect, if they compute the pairwise similarity between each couple of terms, or direct if they consider each set of concepts as a whole. The samplesim method allows to determine the semantic similarity between two documents, each possibly associated to multiple concepts. Finally, the multisim method allows to determine the semantic similarity between documents annotated with two or more ontologies: first samplesim is run for each ontology, then a user defined function can be used to aggregate the resulting semantic similarities for each pair of documents.

The function listSimilarities shows all the measures supported by Onassis. For details about the measures run {?Similarity}.

#Instantiating the Similarity
similarities <- listSimilarities()

5.1 Semantic similarity between ontology terms

The following example shows pairwise similarities between the individual concepts of previously annotated ChIP-seq experiments metadata. The lin similarity measure is used by default, which relies on a ratio between the Information content (IC) of the terms most specific common ancestor, and the sum of their IC (based on the information content of their most informative common ancestor). In particular, the seco information content is used by default, which determines the specificity of each concept based on the number of concepts it subsumes.

#Retrieving URLS of concepts
found_terms <- as.character(unique(chipseq_dict_annot$term_url))

# Creating a dataframe with all possible couples of terms and adding a column to store the similarity
pairwise_results <- t(combn(found_terms, 2))
pairwise_results <- cbind(pairwise_results, rep(0, nrow(pairwise_results)))

# Similarity computation for each couple of terms
for(i in 1:nrow(pairwise_results)){
    pairwise_results[i, 3] <- Similarity(obo, pairwise_results[i,1], pairwise_results[i, 2])
}
colnames(pairwise_results) <- c('term1', 'term2', 'value')  

# Adding the term names from the annotation table to the comparison results 
pairwise_results <- merge(pairwise_results, chipseq_dict_annot[, c('term_url', 'term_name')], by.x='term2', by.y='term_url')
colnames(pairwise_results)[length(colnames(pairwise_results))] <- 'term2_name'
pairwise_results <- merge(pairwise_results, chipseq_dict_annot[, c('term_url', 'term_name')], by.x='term1', by.y='term_url')
colnames(pairwise_results)[length(colnames(pairwise_results))] <- 'term1_name'
pairwise_results <- unique(pairwise_results)
# Reordering the columns
pairwise_results <- pairwise_results[, c('term1', 'term1_name', 'term2', 'term2_name', "value")]
Table 5: Pairwise similarities of cell type terms annotating the ChIP-seq metadata
term1 term1_name term2 term2_name value
1 http://purl.obolibrary.org/obo/CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000055 non-terminally differentiated cell 0.729933133609861
16 http://purl.obolibrary.org/obo/CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000236 B cell 0.391845067198581
121 http://purl.obolibrary.org/obo/CL_0000066 epithelial cell http://purl.obolibrary.org/obo/UBERON_0000926 mesoderm 0.0519216647312627
166 http://purl.obolibrary.org/obo/CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000222 mesodermal cell 0.391845067198581
211 http://purl.obolibrary.org/obo/CL_0000066 epithelial cell http://purl.obolibrary.org/obo/UBERON_0000310 breast 0.0548979704654821
226 http://purl.obolibrary.org/obo/CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0002327 mammary gland epithelial cell 0.830359471632107
391 http://purl.obolibrary.org/obo/CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000945 lymphocyte of B lineage 0.428161870137102
631 http://purl.obolibrary.org/obo/CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000542 lymphocyte 0.45270539422977
676 http://purl.obolibrary.org/obo/CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000066 epithelial cell 0.0475176554118867
691 http://purl.obolibrary.org/obo/CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000542 lymphocyte 0.0459017490429164
711 http://purl.obolibrary.org/obo/CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0002327 mammary gland epithelial cell 0.0406258626128426
716 http://purl.obolibrary.org/obo/CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000222 mesodermal cell 0.0406258626128426
721 http://purl.obolibrary.org/obo/CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000055 non-terminally differentiated cell 0.0475176554118867
761 http://purl.obolibrary.org/obo/CL_0000084 T cell http://purl.obolibrary.org/obo/UBERON_0000926 mesoderm 0.0438023301930405
806 http://purl.obolibrary.org/obo/CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000236 B cell 0.0406258626128426
871 http://purl.obolibrary.org/obo/CL_0000084 T cell http://purl.obolibrary.org/obo/UBERON_0002367 prostate gland 0.0406258626128426
916 http://purl.obolibrary.org/obo/CL_0000084 T cell http://purl.obolibrary.org/obo/UBERON_0000310 breast 0.0459017490429164
931 http://purl.obolibrary.org/obo/CL_0000084 T cell http://purl.obolibrary.org/obo/CL_0000945 lymphocyte of B lineage 0.0438023301930405
1021 http://purl.obolibrary.org/obo/CL_0000222 mesodermal cell http://purl.obolibrary.org/obo/CL_0002327 mammary gland epithelial cell 0.407168835846172
1039 http://purl.obolibrary.org/obo/CL_0000222 mesodermal cell http://purl.obolibrary.org/obo/UBERON_0000926 mesoderm 0.921811689010297
1075 http://purl.obolibrary.org/obo/CL_0000222 mesodermal cell http://purl.obolibrary.org/obo/CL_0000542 lymphocyte 0.58549535261452
1093 http://purl.obolibrary.org/obo/CL_0000222 mesodermal cell http://purl.obolibrary.org/obo/CL_0000055 non-terminally differentiated cell 0.391845067198581
1165 http://purl.obolibrary.org/obo/CL_0000222 mesodermal cell http://purl.obolibrary.org/obo/CL_0000945 lymphocyte of B lineage 0.558716417052732
1219 http://purl.obolibrary.org/obo/CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000222 mesodermal cell 0.518199289825233
1231 http://purl.obolibrary.org/obo/CL_0000236 B cell http://purl.obolibrary.org/obo/UBERON_0000310 breast 0.0459017490429164
1255 http://purl.obolibrary.org/obo/CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000055 non-terminally differentiated cell 0.391845067198581
1303 http://purl.obolibrary.org/obo/CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0002327 mammary gland epithelial cell 0.335013243552152
1411 http://purl.obolibrary.org/obo/CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000542 lymphocyte 0.870134783835804
1519 http://purl.obolibrary.org/obo/CL_0000236 B cell http://purl.obolibrary.org/obo/UBERON_0000926 mesoderm 0.0438023301930405
1579 http://purl.obolibrary.org/obo/CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000945 lymphocyte of B lineage 0.921811689010297
1615 http://purl.obolibrary.org/obo/CL_0000542 lymphocyte http://purl.obolibrary.org/obo/UBERON_0000926 mesoderm 0.0499984102249816
1617 http://purl.obolibrary.org/obo/CL_0000542 lymphocyte http://purl.obolibrary.org/obo/CL_0000055 non-terminally differentiated cell 0.45270539422977
1619 http://purl.obolibrary.org/obo/CL_0000542 lymphocyte http://purl.obolibrary.org/obo/CL_0002327 mammary gland epithelial cell 0.378519810843921
1631 http://purl.obolibrary.org/obo/CL_0000945 lymphocyte of B lineage http://purl.obolibrary.org/obo/CL_0000055 non-terminally differentiated cell 0.428161870137102
1633 http://purl.obolibrary.org/obo/CL_0000945 lymphocyte of B lineage http://purl.obolibrary.org/obo/UBERON_0000926 mesoderm 0.0475176554118867
1635 http://purl.obolibrary.org/obo/CL_0000945 lymphocyte of B lineage http://purl.obolibrary.org/obo/CL_0000542 lymphocyte 0.947792987857007
1636 http://purl.obolibrary.org/obo/CL_0000945 lymphocyte of B lineage http://purl.obolibrary.org/obo/CL_0002327 mammary gland epithelial cell 0.361207363224677
1641 http://purl.obolibrary.org/obo/CL_0002327 mammary gland epithelial cell http://purl.obolibrary.org/obo/CL_0000055 non-terminally differentiated cell 0.606106891151053
1647 http://purl.obolibrary.org/obo/UBERON_0000310 breast http://purl.obolibrary.org/obo/CL_0000222 mesodermal cell 0.460045904728659
1651 http://purl.obolibrary.org/obo/UBERON_0000310 breast http://purl.obolibrary.org/obo/CL_0000945 lymphocyte of B lineage 0.0499984102249816
1659 http://purl.obolibrary.org/obo/UBERON_0000310 breast http://purl.obolibrary.org/obo/CL_0000055 non-terminally differentiated cell 0.0548979704654821
1671 http://purl.obolibrary.org/obo/UBERON_0000310 breast http://purl.obolibrary.org/obo/UBERON_0000926 mesoderm 0.501104300959006
1707 http://purl.obolibrary.org/obo/UBERON_0000310 breast http://purl.obolibrary.org/obo/CL_0002327 mammary gland epithelial cell 0.870134783835804
1727 http://purl.obolibrary.org/obo/UBERON_0000310 breast http://purl.obolibrary.org/obo/CL_0000542 lymphocyte 0.0527524584646166
1763 http://purl.obolibrary.org/obo/UBERON_0000926 mesoderm http://purl.obolibrary.org/obo/CL_0002327 mammary gland epithelial cell 0.439004679408628
1772 http://purl.obolibrary.org/obo/UBERON_0000926 mesoderm http://purl.obolibrary.org/obo/CL_0000055 non-terminally differentiated cell 0.0519216647312627
1778 http://purl.obolibrary.org/obo/UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/CL_0000055 non-terminally differentiated cell 0.0475176554118867
1796 http://purl.obolibrary.org/obo/UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/CL_0000066 epithelial cell 0.0475176554118867
1868 http://purl.obolibrary.org/obo/UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/CL_0000222 mesodermal cell 0.407168835846172
2030 http://purl.obolibrary.org/obo/UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/CL_0000236 B cell 0.0406258626128426
2066 http://purl.obolibrary.org/obo/UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/UBERON_0000926 mesoderm 0.439004679408628
2084 http://purl.obolibrary.org/obo/UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/CL_0000945 lymphocyte of B lineage 0.0438023301930405
2156 http://purl.obolibrary.org/obo/UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/CL_0002327 mammary gland epithelial cell 0.407168835846172
2255 http://purl.obolibrary.org/obo/UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/CL_0000542 lymphocyte 0.0459017490429164
2273 http://purl.obolibrary.org/obo/UBERON_0002367 prostate gland http://purl.obolibrary.org/obo/UBERON_0000310 breast 0.460045904728659

Noteworthy, the terms ‘B-cell’ and ‘lymphocyte’ are closer (similarity 0.83) than ‘B cell’ and ‘epithelial cell’ (similarity 0.26). It is also possible to compute the semantic similarity between two groups of terms. For example, to determine a value of similarity for the combination of (‘non-terminally differentiated cell’, ‘epithelial cell’) and the combination of (‘lymphocyte’ , ‘B cell’) we can use the ui measure (set as default measure in Onassis), a groupwise direct measure combining the intersection and the union of the set of ancestors of the two groups of concepts.

oboprefix <- 'http://purl.obolibrary.org/obo/'
Similarity(obo, paste0(oboprefix, c('CL_0000055', 'CL_0000066')) , paste0(oboprefix, c('CL_0000542', 'CL_0000236')))
## [1] 0.1764706

The similarity between these two groups of terms is low (in the interval [0, 1]), while the addition of the term ‘lymphocyte of B lineage’ to the first group the group similarity increases.

Similarity(obo, paste0(oboprefix, c('CL_0000055', 'CL_0000236' ,'CL_0000236')), paste0(oboprefix, c('CL_0000542', 'CL_0000066')))
## [1] 0.6470588

5.2 Semantic similarity between annotated samples

Similarity also supports the computation of the similarity between annotated samples. Since each sample is typically associated tu multiple terms, Similarity runs in the groupwise mode when applied to samples. To this end, samples identifiers and a data frame with previously annotated concepts returned by EntityFinder are required.

# Extracting all the couples of samples 
annotated_samples <- as.character(as.vector(unique(chipseq_dict_annot$sample_id)))

samples_couples <- t(combn(annotated_samples, 2))

# Computing the samples semantic similarity 
similarity_results <- apply(samples_couples, 1, function(couple_of_samples){
    Similarity(obo, couple_of_samples[1], couple_of_samples[2], chipseq_dict_annot)
})


#Creating a matrix to store the results of the similarity between samples
similarity_matrix <- matrix(0, nrow=length(annotated_samples), ncol=length(annotated_samples))
rownames(similarity_matrix) <- colnames(similarity_matrix) <- annotated_samples

# Filling the matrix with similarity values 
similarity_matrix[lower.tri(similarity_matrix, diag=FALSE)] <- similarity_results
similarity_matrix <- t(similarity_matrix)
similarity_matrix[lower.tri(similarity_matrix, diag=FALSE)] <- similarity_results 
# Setting the diagonal to 1
diag(similarity_matrix) <- 1

# Pasting the annotations to the sample identifiers
samples_legend <- aggregate(term_name ~ sample_id, chipseq_dict_annot, function(aggregation) paste(unique(aggregation), collapse=',' ))
rownames(similarity_matrix) <- paste0(rownames(similarity_matrix), ' (', samples_legend[match(rownames(similarity_matrix), samples_legend$sample_id), c('term_name')], ')')

# Showing a heatmap of the similarity between samples
heatmap.2(similarity_matrix, density.info = "none", trace="none", main='Samples\n semantic\n similarity', margins=c(12,50), cexRow = 2, cexCol = 2, keysize = 0.5)