Contents

1 Introduction to OnASSis

Public repositories contain thousands of experiments and samples that are difficult to mine. Annotating the description of this data with controlled vocabularies or ontology terms could improve the retrieval of data of interest both programmatically or manually (Galeota and Pelizzola 2016). OnASSiS (Ontology Annotations and Semantic Similarity software) is a package aimed at matching metadata associated with biological experiments with concepts from ontologies, thus aiming at obtaining semantically coherent omics datasets, possibly representing various data types ad derived from independent studies. The recognition of domain specific entities not only allows users to retrieve samples related to a given cell type or experimental condition, but also to discover different and not immediately obvious relationships between experiments. Onassis applies Natural Language Processing techniques to analyze sample’s and experiments’ descriptions, recognize concepts from a multitude of biomedical ontologies and to quantify the similarities/divergences between pairs or groups of query studies. In particular the software includes modules to assist on:

Onassis uses Conceptmapper, an Apache UIMA (Unstructured Information Management Architecture) dictionary lookup tool to retrieve dictionary terms in a given text. https://uima.apache.org/downloads/sandbox/ConceptMapperAnnotatorUserGuide/ConceptMapperAnnotatorUserGuide.html
In particular, the ccp-nlp Conceptmapper wrapper, specific for the biomedical domain, implements a pipeline through which it is possible to retrieve concepts from OBO ontologies in any given text with different adjustable options (Verspoor et al. 2009).

Onassis features can be easily accessed through a main class named , having as slots , , and . In the following sections we first show details on the usage of the classes and methods that constitute the building blocks of typical metadata integration workflows and than we show how using the Onassis class their usage is further simplified.

Onassis can handle any type of text as input, but is particularly well suited for the analysis of the metadata from Gene Expression Omnibus (GEO). This represents a fundamental first step in the integrative analysis of the data from those repositories (Galeota and Pelizzola 2016). Indeed it provides the possibility to associate concepts from any OBO ontology to GEO, but also SRA metadata retrieved using GEOmetadb. In addition to the ontological concepts, the recognition of gene/protein symbols or epigenetic modifications can be highly relevant, especially for experiments directed to those specific factors or marks (such as ChIP-seq experiments).

The semantic similarity module uses different semantic similarity measures to determine the semantic similarity of concepts in a given ontology. This module has been developed on the basis of the Java slib http://www.semantic-measures-library.org/sml.

2 Installing Suggested libraries to run the examples

To run Onassis Java (>= 1.8) is needed. For the correct working of the following examples please install the following libraries:

source("https://bioconductor.org/biocLite.R")
biocLite('org.Hs.eg.db')
biocLite("GenomicRanges")
install.packages('data.table')
install.packages('DT')
install.packages('gplots')

3 Retrieving public repositories metadata

One of the most straightforward ways to retrieve GEO metadata is through GEOmetadb package. In order to use them through Onassis the user should download the corresponding SQLite database following the instructions provided in the packages vignette. While the SQLite source databases are required, Onassis allows to access to those data without any knowledge on SQL programming, thus providing functions to help the metadata retrieval without the need of explicitly making queries to the database.

3.1 Handling GEO (Gene Expression Omnibus) Metadata

Firstly, it is necessary to obtain and get a connection to the SQLite database. connectToGEODB returns a connection to the database given the path of the SQLite database file. If the latter is missing, it will be automatically downloaded into the current working directory. Because of the size of these files (0.5-4GB), the results of the queries illustrated below are available into Onassis for the subsequent analyses illustrated in this document. Then, the getGEOmetadata function can be used to retrieve the metadata of specific GEO samples, taking as minimal parameters the connection to the database and one of the experiment types available. Optionally it is possible to specify the organism and the platform.

## Running this function might take long time if the database has to be downloaded.
geo_con <- connectToGEODB(download=TRUE)

#Showing the experiment types available in GEO
experiments <- experiment_types(geo_con)

#Showing the organism types available in GEO
species <- organism_types(geo_con)

#Retrieving the metadata associated to experiment type "Methylation profiling by high througput sequencing"
meth_metadata <- getGEOMetadata(geo_con, experiment_type='Methylation profiling by high throughput sequencing', organism = 'Homo sapiens')

#Retrieving Human gene expression metadata, knowing the GEO platform identifier, e.g. the Affymetrix Human Genome U133 Plus 2.0 Array
expression <- getGEOMetadata(geo_con, experiment_type='Expression profiling by array', gpl='GPL570')

Some of the experiment types available are the following:

Experiment
Expression profiling by MPSS
Expression profiling by RT-PCR
Expression profiling by SAGE
Expression profiling by SNP array
Expression profiling by array
Expression profiling by genome tiling array
Expression profiling by high throughput sequencing
Genome binding/occupancy profiling by SNP array
Genome binding/occupancy profiling by array
Genome binding/occupancy profiling by genome tiling array

Some of the organisms available are the following:

Species
Homo sapiens
Drosophila melanogaster
Mus musculus
Zea mays
Arabidopsis thaliana
Caenorhabditis elegans
Helicobacter pylori
Escherichia coli
Rattus norvegicus
Saccharomyces cerevisiae

As specified before in this document, to correctly query GEOmetadb, it is necessary to download the sqLite file, which occupies sever GB of disk space. Only for this vignette, meth_metadata was previously saved from the getGEOmetadata function and can be loaded from Onassis external data:

meth_metadata <- readRDS(system.file('extdata', 'vignette_data', 'GEOmethylation.rds', package='Onassis'))
Table 1: Methylation profiling by high througput sequencing metadata from GEOmetadb.
series_id gsm title gpl source_name_ch1 organism_ch1 characteristics_ch1 description experiment_title experiment_summary
1251 GSE42590 GSM1045538 2316_DLPFC_Control GPL10999 Brain (dorsolateral prefrontal cortex) Homo sapiens tissue: Heterogeneous brain tissue NA Genome-wide DNA methylation profiling of human dorsolateral prefrontal cortex Reduced representation bisulfite sequencing (RRBS)
511 GSE27432 GSM678217 hEB16d_H9_p65_RRBS GPL9115 embryoid body from hES H9 p65 Homo sapiens cell type: hEB16d_H9_p65 reduced representation bisulfite sequencing Genomic distribution and inter-sample variation of non-CG methylation across human cell types DNA methylation plays an important role in develop
2731 GSE58889 GSM1421876 Normal_CD19_11 GPL11154 Normal CD19+ cells Homo sapiens cell type: Normal CD19+ cells; disease status: healthy NA Methylation disorder in CLL We performed RRBS and WGBS on primary human chroni
1984 GSE50761 GSM1228607 Time Course Off-target Day 7 1 HBB133 GPL15520 K562 cells Homo sapiens cell line: K562 cells; target loci: Time Course Off-target Day 7 1 2013.03.16._MM364_analysis.csv Targeted DNA demethylation using TALE-TET1 fusion proteins Recent large-scale studies have defined genomewide
851 GSE36173 GSM882245 H1 human ES cells GPL10999 H1 human ES cells Homo sapiens cell line: H1 5-hmC whole genome bisulfite sequencing Base Resolution Analysis of 5-Hydroxymethylcytosine in the Mammalian Genome The study of 5-hydroxylmethylcytosines (5hmC), the
1966 GSE50761 GSM1228589 Time Course HB-6 Day 4 1 HBB115 GPL15520 K562 cells Homo sapiens cell line: K562 cells; target loci: Time Course HB-6 Day 4 1 2013.03.16._MM364_analysis.csv Targeted DNA demethylation using TALE-TET1 fusion proteins Recent large-scale studies have defined genomewide
1827 GSE50761 GSM1228450 Off target -650 to -850 3 RHOX117 GPL15520 293 cells Homo sapiens cell line: 293 cells; target loci: Off target -650 to -850 3 2013-07-23-MM195-288-394_analysis.csv Targeted DNA demethylation using TALE-TET1 fusion proteins Recent large-scale studies have defined genomewide
378 GSE26592 GSM655200 Endometrial Recurrent 5 GPL9052 Human endometrial specimen Homo sapiens tissue: Human endometrial specimen; cell type: primary tissues; disease status: Recurrent; chromatin selection: MBD protein MBDCap using MethylMiner Methylated DNA Enrichment Kit (Invitrogen, ME 10025); library strategy: Endometrial samples: MBDCao-seq. Breast cells: MBDCap-seq.; library selection: Endometrial samples: MBDCap. Breast cells: MBDCap-seq. Neighboring genomic regions influence differential methylation patterns of CpG islands in endometrial and breast cancers We report the global methylation patterns by MBDCa
1754 GSE50761 GSM1228377 Initial Screen RH-3 -250-+1 2 RHOX44 GPL15520 HeLa cells Homo sapiens cell line: HeLa cells; target loci: Initial Screen RH-3 -250-+1 2 2013-07-12-MM564_analysis.csv Targeted DNA demethylation using TALE-TET1 fusion proteins Recent large-scale studies have defined genomewide
2371 GSE54961 GSM1327281 Healthy Control GPL9052 Healthy Control Homo sapiens etiology: Healthy Control; tissue: Peripheral venous blood; molecule subtype: serum cell-free DNA Sample 1 Epigenome analysis of serum cell-free circulating DNA in progression of HBV-related Hepatocellular carcinoma Purpose: Aberrantly methylated DNA are hallmarks

3.2 Handling SRA (Sequence Read Archive) Metadata

In this section we provide an example showing how it is possible to retrieve data from other sources such as SRA. In this case we only show an example on how to query the database and store the metadata in a data frame. The following code requires the file SRAdb.sqlite, containing SRA metadata. Also in this case the database file occupies serveral GB of disk space and running this part of code is optional. The database file and the queries can be carried out in R through the Bioconductor package SRAdb.

The following code shows how to obtain SRA metadata of ChIP-Seq human samples and Bisulfite sequencing samples:

# Connection to the SRAmetadb and potential download of the sqlite file
sqliteFileName <- './data/SRAdb.sqlite'
sra_con <- dbConnect(SQLite(), sqliteFileName)()

# Query for the ChIP-Seq experiments contained in GEO for human samples 
library_strategy <- 'ChIP-Seq' #ChIP-Seq data
library_source='GENOMIC' 
taxon_id=9606 #Human samples
center_name='GEO' #Data from GEO
 
# Query to the sample table 
samples_query <- paste0("select sample_accession, description, sample_attribute, sample_url_link from sample where taxon_id='", taxon_id, "' and sample_accession IS NOT NULL", " and center_name='", center_name, "'",  )

samples_df <- dbGetQuery(sra_con, samples_query)
samples <- unique(as.character(as.vector(samples_df[, 1])))

# Query to the experiment table
experiment_query <- paste0("select experiment_accession, center_name, title, sample_accession, sample_name, experiment_alias, library_strategy, library_layout, experiment_url_link, experiment_attribute from experiment where library_strategy='", 
                           library_strategy, "'" , " and library_source ='", library_source,
                           "' " )

experiment_df <- dbGetQuery(sra_con, experiment_query)

#Merging the columns from the sample and the experiment table
experiment_df <- merge(experiment_df, samples_df, by = "sample_accession")

# Replacing the field separators with white spaces
experiment_df$experiment_attribute <- sapply(experiment_df$experiment_attribute, 
                                             function(value) {
                                               gsub("||", "  ", value)
                                             })
experiment_df$sample_attribute <- sapply(experiment_df$sample_attribute, 
                                         function(value) {
                                           gsub("||", "  ", value)
                                         })
# Replacing the '_' character with whitespaces
experiment_df$sample_name <- sapply(experiment_df$sample_name, 
                                    function(value) {
                                      gsub("_", " ", value)
                                    })
experiment_df$experiment_alias <- sapply(experiment_df$experiment_alias, 
                                         function(value) {
                                           gsub("_", " ", value)
                                         })
sra_chip_seq <- experiment_df

To avoid installing SRAmetadb sra_chip_seq was previously saved and can be loaded from Onassis:

sra_chip_seq <- readRDS(system.file('extdata', 'vignette_data', 'GEO_human_chip.rds',  package='Onassis'))
Table 2: ChIP-Seq metadata obtained from SRAdb
sample_accession experiment_accession center_name title library_strategy library_layout experiment_url_link experiment_attribute description sample_attribute sample_url_link
5904 SRS421364 SRX278504 GEO GSM1142700: p53 ChIP LCL nutlin-3 treated; Homo sapiens; ChIP-Seq ChIP-Seq SINGLE - GEO Sample: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1142700 GEO Accession: GSM1142700 NA source_name: lymphoblastoid cells || cell type: nutlin-3 treated lymphoblastoid cells || coriell id: GM12878 || chip antibody: mouse monoclonal anti-human p53 (BD Pharmingen, cat# 554294) || BioSampleModel: Generic GEO Sample GSM1142700: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1142700
4981 SRS371783 SRX199902 GEO GSM1022674: UW_ChipSeq_A549_InputRep1 ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1022674 GEO Accession: GSM1022674 NA source_name: A549 || biomaterial_provider: ATCC || lab: UW || lab description: Stamatoyannopoulous - University of Washington || datatype: ChipSeq || datatype description: Chromatin IP Sequencing || cell: A549 || cell organism: human || cell description: epithelial cell line derived from a lung carcinoma tissue. (PMID: 175022), “This line was initiated in 1972 by D.J. Giard, et al. through explant culture of lung carcinomatous tissue from a 58-year-old caucasian male.” - ATCC, newly promoted to tier 2: not in 2011 analysis || cell karyotype: cancer || cell lineage: endoderm || cell sex: M || antibody: Input || antibody description: Control signal which may be subtracted from experimental raw signal before peaks are called. || treatment: None || treatment description: No special treatment or protocol applies || control: std || control description: Standard input signal for most experiments. || controlid: wgEncodeEH001904 || labexpid: DS18301 || labversion: WindowDensity-bin20-win+/-75 || replicate: 1 || BioSampleModel: Generic GEO Sample GSM1022674: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1022674
4619 SRS365824 SRX190055 GEO GSM945272: UW_ChipSeq_HRPEpiC_Input ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM945272 GEO Accession: GSM945272 NA source_name: HRPEpiC || biomaterial_provider: ScienCell || lab: UW || lab description: Stamatoyannopoulous - University of Washington || datatype: ChipSeq || datatype description: Chromatin IP Sequencing || cell: HRPEpiC || cell organism: human || cell description: retinal pigment epithelial cells || cell karyotype: normal || cell lineage: ectoderm || cell sex: U || antibody: Input || antibody description: Control signal which may be subtracted from experimental raw signal before peaks are called. || treatment: None || treatment description: No special treatment or protocol applies || control: std || control description: Standard input signal for most experiments. || controlid: wgEncodeEH000962 || labexpid: DS16014 || labversion: Bowtie 0.12.7 || replicate: 1 || BioSampleModel: Generic GEO Sample GSM945272: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM945272
911 SRS117344 SRX028649 GEO GSM608166: H3K27me3_K562_ChIP-seq_rep1 ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM608166 GEO Accession: GSM608166 NA source_name: chronic myeloid leukemia cell line || cell line: K562 || harvest date: 2008-06-12 || chip antibody: CST monoclonal rabbit rabbit anti-H3K27me3 || BioSampleModel: Generic GEO Sample GSM608166: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM608166
4244 SRS362733 SRX186665 GEO GSM1003469: Broad_ChipSeq_Dnd41_H3K79me2 ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1003469 GEO Accession: GSM1003469 NA source_name: Dnd41 || biomaterial_provider: DSMZ || datatype: ChipSeq || datatype description: Chromatin IP Sequencing || antibody antibodydescription: Rabbit polyclonal antibody raised against a peptide containing K79 di-methylation. Antibody Target: H3K79me2 || antibody targetdescription: H3K79me2 is a mark of the transcriptional transition region - the region between the initiation marks (K4me3, etc) and the elongation marks (K36me3). || antibody vendorname: Active Motif || antibody vendorid: 39143 || controlid: wgEncodeEH002434 || replicate: 1,2 || softwareversion: ScriptureVPaperR3 || cell sex: M || antibody: H3K79me2 || antibody antibodydescription: Rabbit polyclonal antibody raised against a peptide containing K79 di-methylation. Antibody Target: H3K79me2 || antibody targetdescription: H3K79me2 is a mark of the transcriptional transition region - the region between the initiation marks (K4me3, etc) and the elongation marks (K36me3). || antibody vendorname: Active Motif || antibody vendorid: 39143 || treatment: None || treatment description: No special treatment or protocol applies || control: std || control description: Standard input signal for most experiments. || controlid: Dnd41/Input/std || softwareversion: ScriptureVPaperR3 || BioSampleModel: Generic GEO Sample GSM1003469: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1003469
7502 SRS494656 SRX369112 GEO GSM1252315: CHG092; Homo sapiens; ChIP-Seq ChIP-Seq SINGLE - GEO Sample GSM1252315: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1252315 GEO Accession: GSM1252315 NA source_name: Gastric Primary Sample || tissuetype: Tumor || chip antibody: H3K4me1 || reads length: 101 || BioSampleModel: Generic GEO Sample GSM1252315: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1252315
2127 SRS266173 SRX099863 GEO GSM808752: MCF7_CTCF_REP1 ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM808752 GEO Accession: GSM808752: NA source_name: breast adenocarcinoma cells || cell type: breast adenocarcinoma cells || cell line: MCF7 || antibody: CTCF || BioSampleModel: Generic GEO Sample GSM808752: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM808752
6299 SRS468164 SRX332680 GEO GSM1204476: Input DNA for ChIP; Homo sapiens; ChIP-Seq ChIP-Seq SINGLE - GEO Sample GSM1204476: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1204476 GEO Accession: GSM1204476 NA source_name: MDAMB231 || cell line: MDAMB231 || chip antibody: input || BioSampleModel: Generic GEO Sample GSM1204476: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1204476
832 SRS115184 SRX027300 GEO GSM593367: H3K4me3_H3 ChIP-Seq SINGLE - GEO Web Link: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM593367 GEO Accession: GSM593367 NA source_name: LCL || chip antibody: H3K4me3 || cell type: lymphoblastoid cell line || BioSampleModel: Generic GEO Sample GSM593367: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM593367
8638 SRS598154 SRX528309 GEO GSM1375207: H3_ChIPSeq_Human; Homo sapiens; ChIP-Seq ChIP-Seq SINGLE - GEO Sample GSM1375207: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1375207 GEO Accession: GSM1375207 NA source_name: H3_ChIPSeq_Human || donor age: adult || cell type: sperm || chip antibody: H3F3B || chip antibody vendor: Abnova || BioSampleModel: Generic GEO Sample GSM1375207: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1375207

4 Annotating text with Ontology Concepts

The Onassis EntityFinder class has methods for annotating any text with dictionary terms. More specifically, Onassis can take advantage of the OBO dictionaries (http://www.obofoundry.org/), as shown in the next section where we will define relationships between different samples annotated with different ontology concepts thanks to the structure of the ontology.

4.1 Data preparation

Input text can be provided as:

  • The path of a directory containing named documents (findEntities method).
    • The path of a single file containing multiple documents. In this case each row contains the name/identifier of the document followed by a ‘|’ separator and the text to annotate (findEntities method).
    • A data frame. In this case each row represents a document, first column has to be the document identifier, and the remaining columns will be combined and contain the text to analyze (annotateDF method). This option can be conveniently used with the metadata retrieved from GEOmetadb and SRAdb, possibly selecting a subset of columns.

4.2 Creation of a Conceptmapper Dictionary

Conceptmapper dictionaries are XML files with a set of entries specified by the xml tag <token> with a canonical name (the name of the entry) and one or more variants (synonyms). Additional properties are allowed. The following code represents a sample of the Conceptmapper dictionary obtained from the Brenda tissue ontology.

   <?xml version="1.0" encoding="UTF-8" ?>
   <synonym>
      <token id="http://purl.obolibrary.org/obo/BTO_0005205" canonical="cerebral artery">
          <variant base="cerebral artery"/>
      </token>
      <token id="http://purl.obolibrary.org/obo/BTO_0002179" canonical="184A1N4 cell">
          <variant base="184A1N4 cell"/>
          <variant base="A1N4 cell"/>
      </token>
      <token id="http://purl.obolibrary.org/obo/BTO_0003871" canonical="uterine endometrial cancer cell">
          <variant base="uterine endometrial cancer cell"/>
          <variant base="endometrial cancer cell"/>
          <variant base="uterine endometrial carcinoma cell"/>
          <variant base="endometrial carcinoma cell"/>
      </token>
  </synonym>

The constructor CMdictionary creates an instance of the class CMdictionary.

  • If an XML file containing the Conceptmapper dictionary is already available, it can be uploaded into Onassis indicating its path and setting the dictType option to “CMDICT”.
    • If the dictionary has to be built from an OBO ontology (as a file in the OBO or OWL format), its path has to be provided and dictType has to be set to “OBO”. The synonymType argument can be set to EXACT_ONLY or ALL to consider only canonical concept names or also to include any synonym. The resulting XML file is written in the indicated outputdir. Alternatively, to automatically download the ontology, the URL where the OBO file is located can be provided.
    • To build a dictionary containing only gene/protein names, dictType has to be set to either TARGET or ENTREZ, to include histone types and marks or not, respetively. If a specific Org.xx.eg.db Bioconductor library is indicated in the inputFileOrDb parameter as a character string, gene names will be derived from it. Instead, if inputFileOrDb is empty and a specific species is indicated in the taxID parameter, the gene_info.gz file hosted at NCBI will be downloaded and used to find gene names. If available, this file can be located with the inputFile parameter. Otherwise, it will be automatically downloaded (300MB).
# If a Conceptmapper dictionary is already available the dictType CMDICT can be specified and the corresponding file loaded
sample_dict <- CMdictionary(inputFileOrDb=system.file('extdata', 'cmDict-sample.cs.xml', package = 'Onassis'), dictType = 'CMDICT')

#Creation of a dictionary from the file sample.cs.obo available in OnassisJavaLibs
obo <- system.file('extdata', 'sample.cs.obo', package='OnassisJavaLibs')

sample_dict <- CMdictionary(inputFileOrDb=obo, outputDir=getwd(), synonymType='ALL')

# Creation of a dictionary for human genes/proteins
require(org.Hs.eg.db)
targets <- CMdictionary(dictType='TARGET', inputFileOrDb = 'org.Hs.eg.db')

4.3 Setting the options for the annotator

Conceptmapper includes 7 different options controlling the annotation step. These are documented in detail in the documentation of the CMoptions function. They can be listed through the listCMOptions function. The CMoptions constructor instantiates an object of class CMoptions with the different parameters that will be required for the subsequent step of annotation. We also provided getter and setter methods for each of the 7 parameters.

#Creating a CMoptions object and showing hte default parameters 
opts <- CMoptions()  
show(opts)
## CMoptions object to set ConceptMapper Options
## SearchStrategy: CONTIGUOUS_MATCH
## CaseMatch: CASE_INSENSITIVE
## Stemmer: NONE
## StopWords: NONE
## OrderIndependentLookup: ON
## FindAllMatches: YES
## SynonymType: ALL

To list the possible combinations:

combinations <- listCMOptions()

To create a CMoptions object having has SynonymType ‘EXACT_ONLY’

myopts <- CMoptions(SynonymType = 'EXACT_ONLY')
myopts
## CMoptions object to set ConceptMapper Options
## SearchStrategy: CONTIGUOUS_MATCH
## CaseMatch: CASE_INSENSITIVE
## Stemmer: NONE
## StopWords: NONE
## OrderIndependentLookup: ON
## FindAllMatches: YES
## SynonymType: EXACT_ONLY

To change a given parameter

#Changing the SearchStrategy parameter
SearchStrategy(myopts) <- 'SKIP_ANY_MATCH_ALLOW_OVERLAP'
myopts
## CMoptions object to set ConceptMapper Options
## SearchStrategy: SKIP_ANY_MATCH_ALLOW_OVERLAP
## CaseMatch: CASE_INSENSITIVE
## Stemmer: NONE
## StopWords: NONE
## OrderIndependentLookup: ON
## FindAllMatches: YES
## SynonymType: EXACT_ONLY

4.4 Running the entity finder

The class EntityFinder is used to define a type system and run the Conceptmapper pipeline. It can find concepts of any OBO ontology in a given text. The findEntities and annotateDF methods accept text within files or data.frame, respectively, as described in Section 4.1. The function EntityFinder automatically adapts to the provided input type, creates an instance of the EntityFinder class to initialize the type system and runs the pipeline with the provided options and dictionary. For example, to annotate the metadata derived from ChIP-seq experiments obtained from SRA with tissue and cell type concepts belonging to BRENDA ontology the following code can be used:

chipseq_dict_annot <- EntityFinder(sra_chip_seq[1:20,c('sample_accession', 'title', 'experiment_attribute', 'sample_attribute', 'description')], dictionary=sample_dict, options=myopts)

The resulting data.frame contains for each row a match to the provided dictionary for a specific document/sample (indicated in the first column). The annotation is reported with the id of the concept (term_id), its canonical name (term name), its URL in the obo format, and the matching sentence of the document.

Table 3: Annotations of the methylation profiling by high througput sequencing metadata obtained from GEO with BRENDA ontology concepts
sample_id term_id term_name term_url matched_sentence
SRS115184 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS117344 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS213443 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell: HCPEpiC || cell organism: Human || cell description: Human Choroid Plexus Epithelial
SRS213443 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS213443 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell organism: Human || cell description: Human Choroid Plexus Epithelial
SRS213443 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell description: Human Choroid Plexus Epithelial
SRS213443 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 Epithelial Cells || cell
SRS241934 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS266173 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS285318 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS336079 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 cell: GM12878 || cell organism: human || cell description: B
SRS336079 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS336079 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 cell organism: human || cell description: B
SRS336079 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 cell description: B
SRS336079 CL_0000945 lymphocyte of B lineage http://purl.obolibrary.org/obo/CL_0000945 B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Virus || cell karyotype: normal || cell lineage: mesoderm || cell sex: F || treatment: None || treatment description: No special treatment or protocol applies || antibody: Pol2(phosphoS2) || antibody antibodydescription: Rabbit polyclonal against peptide conjugated to KLH derived from within residues 1600 - 1700 of
SRS336079 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B-lymphocyte, lymphoblastoid, International HapMap Project - CEPH/Utah - European Caucasion, Epstein-Barr Virus || cell
SRS336079 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B-lymphocyte
SRS336079 CL_0000542 lymphocyte http://purl.obolibrary.org/obo/CL_0000542 lymphocyte
SRS346539 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS362733 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell

The function EntityFinder can also be used to identify the targeted entity of each ChIP-seq experiment, by retrieving gene names and histone types or modifications in the ChIP-seq metadata.

#Finding the TARGET entities
target_entities <- EntityFinder(input=sra_chip_seq[1:20,c('sample_accession', 'title', 'experiment_attribute', 'sample_attribute', 'description')], options = myopts, dictionary=targets) 
Table 4: Annotations of ChIP-seq test metadata obtained from SRAdb and stored into files with the TARGETs (genes and histone variants)
sample_id term_id term_name term_url matched_sentence
SRS115184 Reference T1 H3K4me3 H3K4me3 NA H3K4me3
SRS117344 Reference T1 H3K27me3 H3K27me3 NA H3K27me3
SRS362733 Reference T1 H3K79me2 H3K79me2 NA H3K79me2
SRS362733 Reference T2 H3K79me2 H3K79me2 NA H3K79me2
SRS362733 Reference T3 H3K79me2 H3K79me2 NA H3K79me2
SRS362733 Reference T4 H3K79me2 H3K79me2 NA H3K79me2
SRS362733 Reference T5 H3K79me2 H3K79me2 NA H3K79me2
SRS362801 Reference T1 H3K79me2 H3K79me2 NA H3K79me2
SRS362801 Reference T2 H3K79me2 H3K79me2 NA H3K79me2
SRS362801 Reference T3 H3K79me2 H3K79me2 NA H3K79me2
SRS362801 Reference T4 H3K79me2 H3K79me2 NA H3K79me2
SRS362801 Reference T5 H3K79me2 H3K79me2 NA H3K79me2
SRS410226 Reference T1 H3K27ac H3K27ac NA H3K27ac
SRS494656 Reference T1 H3K4me1 H3K4me1 NA H3K4me1

5 Semantic similarity

With Onassis it is possible to quantify the semantic similarity between the concepts of a given ontology using different semantic similarity measures. Similarity is an Onassis class applying methods of the Java library slib (Harispe et al. 2014), which builds a semantic graph starting from OBO ontology concepts and their hierarchical relationships. The following methods are available and are automatically chosen depending on the settings of the Similarity function. The sim and groupsim methods allow the computation of semantic similarity between single terms (pairwise measures) and between group of terms (groupwise measures), respectively. Pairwise measures can be edge based, if they rely only on the structure of the ontology, or information-content based if they also consider the information that each term in the ontology carries. Rather, groupwise measures can be indirect, if they compute the pairwise similarity between each couple of terms, or direct if they consider each set of concepts as a whole. The samplesim method allows to determine the semantic similarity between two documents, each possibly associated to multiple concepts, using groupwise measures. Finally, the multisim method allows to determine the semantic similarity between documents annotated with two or more ontologies: first samplesim is run for each ontology, then a user defined function can be used to aggregate the resulting semantic similarities for each pair of documents.

The function listSimilarities shows all the measures supported by Onassis. For details about the measures run {?Similarity}.

#Instantiating the Similarity
similarities <- listSimilarities()

The following example shows the pairwise similarities between sample cell concepts obtained annotating the ChIP-seq metadata. The resnik similarity measure is used by default, which is based on the information content of the most informative common ancestor of the considered concepts. In particular, the seco information content is used by default, which determines the specificity of each concept based on the number of concepts it subsumes.

found_terms <- unique(chipseq_dict_annot$term_url)
n <- length(found_terms)

ontologyfile <- obo
pairwise_results <- data.frame(term1 = character(0), term2= character(0), value = double(0L))
for(i in 1:(n-1)){
  term1 <- as.character(found_terms[i])
  j = i + 1 
  for(k in j:n){
    term2 <- as.character(found_terms[k])
    two_term_similarity <- Similarity(ontologyfile,  term1, term2 )
    new_row <- cbind(term1, term2, two_term_similarity)
    pairwise_results <- rbind(pairwise_results, new_row )
  }
}
pairwise_results <- unique(pairwise_results)
pairwise_results <- merge(pairwise_results, chipseq_dict_annot[, c('term_url', 'term_name')], by.x='term2', by.y='term_url', all.x=TRUE)
colnames(pairwise_results)[length(colnames(pairwise_results))] <- 'term2_name'
pairwise_results <- merge(pairwise_results, chipseq_dict_annot[, c('term_url', 'term_name')], by.x='term1', by.y='term_url', all.x=TRUE)
colnames(pairwise_results)[length(colnames(pairwise_results))] <- 'term1_name'
pairwise_results <- unique(pairwise_results)
Table 5: Pairwise similarities of cell line terms annotating the ChIP-seq metadata
term1 term2 two_term_similarity term2_name term1_name
1 http://purl.obolibrary.org/obo/CL_0000000 http://purl.obolibrary.org/obo/CL_0000066 0.226149129142891 epithelial cell cell
205 http://purl.obolibrary.org/obo/CL_0000000 http://purl.obolibrary.org/obo/CL_0000236 0.226149129142891 B cell cell
290 http://purl.obolibrary.org/obo/CL_0000000 http://purl.obolibrary.org/obo/CL_0000945 0.226149129142891 lymphocyte of B lineage cell
307 http://purl.obolibrary.org/obo/CL_0000000 http://purl.obolibrary.org/obo/CL_0000542 0.226149129142891 lymphocyte cell
324 http://purl.obolibrary.org/obo/CL_0000066 http://purl.obolibrary.org/obo/CL_0000236 0.268130656074674 B cell epithelial cell
384 http://purl.obolibrary.org/obo/CL_0000066 http://purl.obolibrary.org/obo/CL_0000945 0.268130656074674 lymphocyte of B lineage epithelial cell
396 http://purl.obolibrary.org/obo/CL_0000066 http://purl.obolibrary.org/obo/CL_0000542 0.268130656074674 lymphocyte epithelial cell
408 http://purl.obolibrary.org/obo/CL_0000236 http://purl.obolibrary.org/obo/CL_0000945 0.820947768248959 lymphocyte of B lineage B cell
413 http://purl.obolibrary.org/obo/CL_0000236 http://purl.obolibrary.org/obo/CL_0000542 0.716208927004165 lymphocyte B cell
418 http://purl.obolibrary.org/obo/CL_0000945 http://purl.obolibrary.org/obo/CL_0000542 0.716208927004165 lymphocyte lymphocyte of B lineage

In the following code the semantic similarity between two groups of terms is computed using the ui measure, a groupwise direct measure combining the intersection and the union of the set of ancestors of the two groups of concepts.

Similarity(obo, found_terms[1:2], found_terms[3])
## [1] 0.1875

Lastly, the pariwise semantic similarity between ChIP-seq samples is illustrated.

annotated_samples <- as.character(as.vector(unique(chipseq_dict_annot$sample_id)))
n <- length(annotated_samples)


samples_results <- data.frame(sample1 = character(0), sample2= character(0), value = double(0L))
samples_results <- matrix(0, nrow=n, ncol=n)
rownames(samples_results) <- colnames(samples_results) <- annotated_samples
for(i in 1:(n-1)){
  sample1 <- as.character(annotated_samples[i])
  j = i + 1 
  for(k in j:n){
    sample2 <- as.character(annotated_samples[k])
    two_samples_similarity <- Similarity(ontologyfile, sample1, sample2, chipseq_dict_annot)
    samples_results[i, k] <- samples_results[k, i] <- two_samples_similarity
  }
}
diag(samples_results) <- 1
heatmap.2(samples_results, density.info = "none", trace="none", main='Semantic similarity of annotated samples', margins=c(5,5))

6 Onassis class

The class Onassis was built to wrap the functionalities of the package in a single class. It consists of 4 slots: * dictionary: stores the source dictionary used to find entities * entities: a table containing the annotations of documents (samples) in terms of semantic sets * similarity: a matrix of the similarities between the unique semantic sets identified in the entities table * scores: a dataset of quantitative measurements (e.g. gene expression) associated to the samples annotated in entities and separated in the different semantic sets identified in the annotation process.

In this section we illustrate the use of the Onassis class to annotate the previously retrieved metadata. The method annotate takes as input a data frame of metadata to annotate, the type of dictionary and the path of an ontology file and returns an instance of class Onassis.

onassis_annotations <- annotate(sra_chip_seq, 'OBO',obo )

To retrieve the annotations in an object of class Onassis we provided the accessor method entities

onassis_entities <- entities(onassis_annotations) 
(#tab:showing_entities)Entities in Onassis object
sample_id term_id term_name term_url matched_sentence
SRS259409 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS527935 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS359583 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS209011 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS412542 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS474132 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS140210 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS259379 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS396944 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell
SRS470330 CL_0000000 cell http://purl.obolibrary.org/obo/CL_0000000 cell

The filterconcepts method can be used to filter out unwanted annotations. It takes the Onassis object and removes from its entities the undesired concepts.

filtered_onassis <- filterconcepts(onassis_annotations, c('cell'))
(#tab:showing_filt_entities)Entities in filtered Onassis object
sample_id term_id term_name term_url matched_sentence
26 SRS114716 CL_0000988 hematopoietic cell http://purl.obolibrary.org/obo/CL_0000988 cell, hematopoietic cell
38 SRS150662 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell, epithelial cell
55 SRS193335 CL_0000988 hematopoietic cell http://purl.obolibrary.org/obo/CL_0000988 cell, hematopoietic cell
74 SRS213465 CL_0000236,CL_0000542 B cell,lymphocyte http://purl.obolibrary.org/obo/CL_0000236,http://purl.obolibrary.org/obo/CL_0000542 cell, B-Lymphocyte, Lymphocyte
84 SRS259401 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B-cell, cell
91 SRS266509 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell, epithelial cell
92 SRS266540 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell, epithelial cell
118 SRS300080 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B-cell, cell
119 SRS300093 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B-cell, cell
120 SRS300100 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B-cell, cell
123 SRS309788 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell, epithelial cell
128 SRS334493 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B-cell, cell
140 SRS335949 CL_0000236,CL_0000542 B cell,lymphocyte http://purl.obolibrary.org/obo/CL_0000236,http://purl.obolibrary.org/obo/CL_0000542 cell, B-lymphocyte, lymphocyte
146 SRS336079 CL_0000236,CL_0000542 B cell,lymphocyte http://purl.obolibrary.org/obo/CL_0000236,http://purl.obolibrary.org/obo/CL_0000542 cell, B-lymphocyte, lymphocyte
180 SRS362679 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell, epithelial cell
181 SRS362689 CL_0000236,CL_0000542 B cell,lymphocyte http://purl.obolibrary.org/obo/CL_0000236,http://purl.obolibrary.org/obo/CL_0000542 cell, B-lymphocyte, lymphocyte
206 SRS365917 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell, epithelial cell
207 SRS365919 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell, epithelial cell
211 SRS365957 CL_0000540 neuron http://purl.obolibrary.org/obo/CL_0000540 Neuron, cell
216 SRS366055 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell, epithelial cell
226 SRS371749 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell, epithelial cell
227 SRS371783 CL_0000066 epithelial cell http://purl.obolibrary.org/obo/CL_0000066 cell, epithelial cell
264 SRS430188 CL_0000081,CL_0000542 blood cell,lymphocyte http://purl.obolibrary.org/obo/CL_0000081,http://purl.obolibrary.org/obo/CL_0000542 Blood || cell, cell, lymphocyte
301 SRS478217 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B cell, cell
336 SRS494681 CL_0000542 lymphocyte http://purl.obolibrary.org/obo/CL_0000542 cell, lymphocyte
379 SRS580028 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B-cell, cell
383 SRS606839 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B-cell, cell
427 SRS716511 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B cell, cell
428 SRS716512 CL_0000236 B cell http://purl.obolibrary.org/obo/CL_0000236 B cell, cell

The method sim cretes a matrix of the semantic similarities between the annotations of each couple of samples annotated in the entities slot of an Onassis object.

filtered_onassis <- sim(filtered_onassis)

Annotations with semantic similarities above a given threshold can be unified using the method collapse. This method unifies the similar annotations by concatenating their unique concepts. Entities are replaced with the new concatenated annotations. For each concept in the concatenated annotations the number of samples associated is also reported, together with the total number of samples annotated with the new annotations. The similarity slot will be consequently updated

collapsed_onassis <- Onassis::collapse(filtered_onassis, 0.8)
head(entities(collapsed_onassis))

heatmap.2(simil(collapsed_onassis), margins=c(15,15), cexRow = 1, cexCol = 1)

7 Session Info

Here is the output of sessionInfo() on the system on which this document was compiled through kintr:

## R version 3.5.1 Patched (2018-07-12 r74967)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
##  [3] LC_TIME=en_US.UTF-8           LC_COLLATE=C                 
##  [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
##  [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
##  [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
## [11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] kableExtra_0.9.0      org.Hs.eg.db_3.6.0    AnnotationDbi_1.42.1 
##  [4] IRanges_2.14.10       S4Vectors_0.18.3      Biobase_2.40.0       
##  [7] BiocGenerics_0.26.0   gplots_3.0.1          DT_0.4               
## [10] Onassis_1.2.3         OnassisJavaLibs_1.2.0 rJava_0.9-10         
## [13] BiocStyle_2.8.2      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18       tidyr_0.8.1        gtools_3.8.1      
##  [4] assertthat_0.2.0   rprojroot_1.3-2    digest_0.6.15     
##  [7] plyr_1.8.4         R6_2.2.2           backports_1.1.2   
## [10] RSQLite_2.1.1      evaluate_0.11      highr_0.7         
## [13] httr_1.3.1         pillar_1.3.0       rlang_0.2.1       
## [16] rstudioapi_0.7     data.table_1.11.4  gdata_2.18.0      
## [19] blob_1.1.1         rmarkdown_1.10     readr_1.1.1       
## [22] stringr_1.3.1      htmlwidgets_1.2    RCurl_1.95-4.11   
## [25] bit_1.1-14         munsell_0.5.0      compiler_3.5.1    
## [28] xfun_0.3           pkgconfig_2.0.1    htmltools_0.3.6   
## [31] tidyselect_0.2.4   tibble_1.4.2       GEOquery_2.48.0   
## [34] bookdown_0.7       viridisLite_0.3.0  crayon_1.3.4      
## [37] dplyr_0.7.6        bitops_1.0-6       DBI_1.0.0         
## [40] magrittr_1.5       scales_0.5.0       KernSmooth_2.23-15
## [43] stringi_1.2.4      bindrcpp_0.2.2     limma_3.36.2      
## [46] xml2_1.2.0         tools_3.5.1        bit64_0.9-7       
## [49] glue_1.3.0         purrr_0.2.5        hms_0.4.2         
## [52] yaml_2.2.0         colorspace_1.3-2   caTools_1.17.1.1  
## [55] rvest_0.3.2        memoise_1.1.0      GEOmetadb_1.42.0  
## [58] knitr_1.20         bindr_0.1.1

References

Galeota, Eugenia, and Mattia Pelizzola. 2016. “Ontology-Based Annotations and Semantic Relations in Large-Scale (Epi) Genomics Data.” Briefings in Bioinformatics. Oxford Univ Press, bbw036.

Harispe, Sébastien, Sylvie Ranwez, Stefan Janaqi, and Jacky Montmain. 2014. “The Semantic Measures Library and Toolkit: Fast Computation of Semantic Similarity and Relatedness Using Biomedical Ontologies.” Bioinformatics 30 (5). Oxford Univ Press:740–42.

Verspoor, K., W. Baumgartner Jr, C. Roeder, and L. Hunter. 2009. “Abstracting the Types away from a UIMA Type System.” From Form to Meaning: Processing Texts Automatically. Tübingen:Narr, 249–56.