In recent years a wealth of biological data has become available in public data repositories. Easy access to these valuable data resources and firm integration with data analysis is needed for comprehensive bioinformatics data analysis. The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. Examples of BioMart databases are Ensembl, Uniprot and HapMap. These major databases give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from R.
Every analysis with biomaRt starts with selecting a BioMart database to use. A first step is to check which BioMart web services are available. The function
listMarts() will display all available BioMart web services
## biomart version ## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 100 ## 2 ENSEMBL_MART_MOUSE Mouse strains 100 ## 3 ENSEMBL_MART_SNP Ensembl Variation 100 ## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 100
useMart() function can now be used to connect to a specified BioMart database, this must be a valid name given by
listMarts(). In the next example we choose to query the Ensembl BioMart database.
ensembl <- useMart("ensembl")
BioMart databases can contain several datasets, for Ensembl every species is a different dataset. In a next step we look at which datasets are available in the selected BioMart by using the function
listDatasets(). Note: use the function
head() to display only the first 5 entries as the complete list has 203 entries.
datasets <- listDatasets(ensembl) head(datasets)
## dataset description version ## 1 acalliptera_gene_ensembl Eastern happy genes (fAstCal1.2) fAstCal1.2 ## 2 acarolinensis_gene_ensembl Anole lizard genes (AnoCar2.0) AnoCar2.0 ## 3 acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2) bAquChr1.2 ## 4 acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5) Midas_v5 ## 5 amelanoleuca_gene_ensembl Panda genes (ailMel1) ailMel1 ## 6 amexicanus_gene_ensembl Mexican tetra genes (Astyanax_mexicanus-2.0) Astyanax_mexicanus-2.0
To select a dataset we can update the
Mart object using the function
useDataset(). In the example below we choose to use the hsapiens dataset.
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
Or alternatively if the dataset one wants to use is known in advance, we can select a BioMart database and dataset in one step by:
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
getBM() function has three arguments that need to be introduced: filters, attributes and values.
Filters define a restriction on the query. For example you want to restrict the output to all genes located on the human X chromosome then the filter chromosome_name can be used with value ‘X’. The
listFilters() function shows you all available filters in the selected dataset.
filters = listFilters(ensembl) filters[1:5,]
## name description ## 1 chromosome_name Chromosome/scaffold name ## 2 start Start ## 3 end End ## 4 strand Strand ## 5 chromosomal_region e.g. 1:100:10000:-1, 1:100000:200000:1
Attributes define the values we are interested in to retrieve. For example we want to retrieve the gene symbols or chromosomal coordinates. The
listAttributes() function displays all available attributes in the selected dataset.
attributes = listAttributes(ensembl) attributes[1:5,]
## name description page ## 1 ensembl_gene_id Gene stable ID feature_page ## 2 ensembl_gene_id_version Gene stable ID version feature_page ## 3 ensembl_transcript_id Transcript stable ID feature_page ## 4 ensembl_transcript_id_version Transcript stable ID version feature_page ## 5 ensembl_peptide_id Protein stable ID feature_page
getBM() function is the main query function in biomaRt. It has four main arguments:
attributes: is a vector of attributes that one wants to retrieve (= the output of the query).
filters: is a vector of filters that one wil use as input to the query.
values: a vector of values for the filters. In case multple filters are in use, the values argument requires a list of values where each position in the list corresponds to the position of the filters in the filters argument (see examples below).
mart: is an object of class
Mart, which is created by the
Now that we selected a BioMart database and dataset, and know about attributes, filters, and the values for filters; we can build a biomaRt query. Let’s make an easy query for the following problem: We have a list of Affymetrix identifiers from the u133plus2 platform and we want to retrieve the corresponding EntrezGene identifiers using the Ensembl mappings.
The u133plus2 platform will be the filter for this query and as values for this filter we use our list of Affymetrix identifiers. As output (attributes) for the query we want to retrieve the EntrezGene and u133plus2 identifiers so we get a mapping of these two identifiers as a result. The exact names that we will have to use to specify the attributes and filters can be retrieved with the
listFilters() function respectively. Let’s now run the query:
affyids=c("202763_at","209310_s_at","207500_at") getBM(attributes=c('affy_hg_u133_plus_2', 'entrezgene_id'), filters = 'affy_hg_u133_plus_2', values = affyids, mart = ensembl)
## affy_hg_u133_plus_2 entrezgene_id ## 1 209310_s_at 837 ## 2 207500_at 838 ## 3 202763_at 836
listFilters() will return every available option for their respective types. However, this can be unwieldy when the list of results is long, involving much scrolling to find the entry you are interested in. biomaRt also provides the functions
searchFilters() which will try to find any entries matching a specific term or pattern. For example, if we want to find the details of any datasets in our
ensembl mart that contain the term ‘hsapiens’ we could do the following:
searchDatasets(mart = ensembl, pattern = "hsapiens")
## dataset description version ## 79 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13
You can search in a simlar fashion to find available attributes and filters that you may be interested in. The example below returns the details for all attributes that contain the pattern ‘hgnc’.
searchAttributes(mart = ensembl, pattern = "hgnc")
## name description page ## 63 hgnc_id HGNC ID feature_page ## 64 hgnc_symbol HGNC symbol feature_page ## 93 hgnc_trans_name Transcript name ID feature_page
For advanced use, note that the pattern argument takes a regular expression. This means you can create more complex queries if required. Imagine, for example, that we have the string ENST00000577249.1, which we know is an Ensembl ID of some kind, but we aren’t sure what the appropriate filter term is. The example shown next uses a pattern that will find all filters that contain both the terms ‘ensembl’ and ‘id’. This allows use to reduced the list of filters to only those that might be appropriate for our example.
searchFilters(mart = ensembl, pattern = "ensembl.*id")
## name description ## 50 ensembl_gene_id Gene stable ID(s) [e.g. ENSG00000000003] ## 51 ensembl_gene_id_version Gene stable ID(s) with version [e.g. ENSG00000000003.15] ## 52 ensembl_transcript_id Transcript stable ID(s) [e.g. ENST00000000233] ## 53 ensembl_transcript_id_version Transcript stable ID(s) with version [e.g. ENST00000000233.10] ## 54 ensembl_peptide_id Protein stable ID(s) [e.g. ENSP00000000233] ## 55 ensembl_peptide_id_version Protein stable ID(s) with version [e.g. ENSP00000000233.5] ## 56 ensembl_exon_id Exon ID(s) [e.g. ENSE00000000001]
From this we can see that ENST00000577249.1 is a Transcript ID with version, and the appropriate filter value to use with it is
Many filters have a predefined list of values that are known to be in the dataset they are associated with. An common example would be the names of chromosomes when searching a dataset at Ensembl. In this online interface to BioMart these available options are displayed as a list as shown in Figure 1.
You can list this in an R session with the function
listFilterValues(), passing it a mart object and the name of the filter. For example, to list the possible chromosome names you could run the following:
listFilterValues(mart = ensembl, filter = "chromosome_name")
It is also possible to search the list of available values via
searchFilterValues(). In the examples below, the first returns all chromosome names starting with “GL”, which the second will find any phenotype descriptions that contain the string “Crohn”.
searchFilterValues(mart = ensembl, filter = "chromosome_name", pattern = "^GL") searchFilterValues(mart = ensembl, filter = "phenotype_description", pattern = "Crohn")
For large BioMart databases such as Ensembl, the number of attributes displayed by the
listAttributes() function can be very large.
In BioMart databases, attributes are put together in pages, such as sequences, features, homologs for Ensembl.
An overview of the attributes pages present in the respective BioMart dataset can be obtained with the
pages = attributePages(ensembl) pages
##  "feature_page" "structure" "homologs" "snp" "snp_somatic" "sequences"
To show us a smaller list of attributes which belong to a specific page, we can now specify this in the
listAttributes() function. The set of attributes is still quite long, so we use
head() to show only the first few items here.
## name description page ## 1 ensembl_gene_id Gene stable ID feature_page ## 2 ensembl_gene_id_version Gene stable ID version feature_page ## 3 ensembl_transcript_id Transcript stable ID feature_page ## 4 ensembl_transcript_id_version Transcript stable ID version feature_page ## 5 ensembl_peptide_id Protein stable ID feature_page ## 6 ensembl_peptide_id_version Protein stable ID version feature_page
We now get a short list of attributes all of which are related to the region where the genes are located.
To save time and computing resources biomaRt will attempt to identify when you are re-running a query you have executed before. Each time a new query is run, the results will be saved to a cache on your computer. If a query is identified as having been run previously, rather than submitting the query to the server, the results will be loaded from the cache.
You can get some information on the size and location of the cache using the function
## biomaRt cache ## - Location: /home/biocbuild/.cache/biomaRt ## - No. of files: 434 ## - Total size: 63.8 Mb
The cache can be deleted using the command
biomartCacheClear(). This will remove all cached files.
There are a small number of non-Ensembl databases that offer a BioMart interface to their data. The biomaRt package can be used to access these in a very similar fashion to Ensembl. The majority of biomaRt functions will work in the same manner, but the construction of the initial Mart object requires slightly more setup. In this section we demonstrate the setting requires to query Wormbase ParaSite and Phytozome
To demonstrate the use of the biomaRt package with non-Ensembl databases the next query is performed using the Wormbase ParaSite BioMart. In this example, we use the
listMarts() function to find the name of the available marts, given the URL of Wormbase. We use this to connect to Wormbase BioMart using the
useMart() function. Note that we use the
https address and must provide the port as
443. Queries to WormBase will fail without these options.
listMarts(host = "parasite.wormbase.org")
## biomart version ## 1 parasite_mart ParaSite Mart
wormbase <- useMart(biomart = "parasite_mart", host = "https://parasite.wormbase.org", port = 443)
We can then use functions described earlier in this vignette to find and select the gene dataset, and print the first 6 available attributes and filters. Then we use a list of gene names as filter and retrieve associated transcript IDs and the transcript biotype.
## dataset description version ## 1 wbps_gene All Species (WBPS14) 14
wormbase <- useDataset(mart = wormbase, dataset = "wbps_gene") head(listFilters(wormbase))
## name description ## 1 species_id_1010 Genome ## 2 nematode_clade_1010 Nematode Clade ## 3 chromosome_name Chromosome name ## 4 start Start ## 5 end End ## 6 strand Strand
## name description page ## 1 species_id_key Internal Name feature_page ## 2 production_name_1010 Genome project feature_page ## 3 display_name_1010 Genome name feature_page ## 4 taxonomy_id_1010 Taxonomy ID feature_page ## 5 assembly_accession_1010 Assembly accession feature_page ## 6 nematode_clade_1010 Nematode clade feature_page
getBM(attributes = c("external_gene_id", "wbps_transcript_id", "transcript_biotype"), filters = "gene_name", values = c("unc-26","his-33"), mart = wormbase)
## external_gene_id wbps_transcript_id transcript_biotype ## 1 his-33 F17E9.13.1 protein_coding ## 2 unc-26 JC8.10a.1 protein_coding ## 3 unc-26 JC8.10b.1 protein_coding ## 4 unc-26 JC8.10c.1 protein_coding ## 5 unc-26 JC8.10c.2 protein_coding ## 6 unc-26 JC8.10d.1 protein_coding
The Phytozome BioMart can be accessed with the following settings. Note that we use the
https address. Queries to Phytozome will fail without specifying this.
listMarts(host = "https://phytozome.jgi.doe.gov")
## biomart version ## 1 phytozome_mart V12 Genomes and Families ## 2 phytozome_diversity__mart V12 Genome Diversity ## 3 phytozome_mart_archive Genome Archive
phytozome <- useMart(biomart = "phytozome_mart", host = "https://phytozome.jgi.doe.gov") listDatasets(phytozome)
## dataset description version ## 1 phytozome Phytozome 12 Genomes ## 2 phytozome_clusters Phytozome 12 Families
wormbase <- useDataset(mart = phytozome, dataset = "phytozome")
Once this is set up the usual biomaRt functions can be used to interogate the database options and run queries.
getBM(attributes = c("organism_name", "gene_name1"), filters = "gene_name_filter", values = "82092", mart = phytozome)
## Error in martCheck(mart): No dataset selected, please select a dataset first. You can see the available datasets by using the listDatasets function see ?listDatasets for more information. Then you should create the Mart object by using the useMart function. See ?useMart for more information
Version 13 of Phyotozome can be found at https://phytozome-next.jgi.doe.gov/ and if you wish to query that version the URL used to create the Mart object must reflect that.
phytozome_v13 <- useMart(biomart = "phytozome_mart", dataset = "phytozome", host = "https://phytozome-next.jgi.doe.gov")
The biomaRt package can be used with a local install of a public BioMart database or a locally developed BioMart database and web service.
In order for biomaRt to recognize the database as a BioMart, make sure that the local database you create has a name conform with
database_mart_version where database is the name of the database and version is a version number. No more underscores than the ones showed should be present in this name. A possible name is for example
## Minimum requirements for local database installation
More information on installing a local copy of a BioMart database or develop your own BioMart database and webservice can be found on http://www.biomart.org
Once the local database is installed you can use biomaRt on this database by:
listMarts(host="www.myLocalHost.org", path="/myPathToWebservice/martservice") mart=useMart("nameOfMyMart",dataset="nameOfMyDataset",host="www.myLocalHost.org", path="/myPathToWebservice/martservice")
For more information on how to install a public BioMart database see: http://www.biomart.org/install.html and follow link databases.
In order to provide a more consistent interface to all annotations in
keys() have been implemented to wrap
some of the existing functionality above. These methods can be called
in the same manner that they are used in other parts of the project
except that instead of taking a
AnnotationDb derived class
they take instead a
Mart derived class as their 1st argument.
Otherwise usage should be essentially the same. You still use
columns() to discover things that can be extracted from a
keytypes() to discover which things can be
used as keys with
mart <- useMart(dataset="hsapiens_gene_ensembl",biomart='ensembl') head(keytypes(mart), n=3)
##  "affy_hc_g110" "affy_hg_focus" "affy_hg_u133_plus_2"
##  "3_utr_end" "3_utr_end" "3_utr_start"
And you still can use
keys() to extract potential keys, for a
particular key type.
k = keys(mart, keytype="chromosome_name") head(k, n=3)
##  "1" "2" "3"
keys(), you can even take advantage of the extra
arguments that are available for others keys methods.
k = keys(mart, keytype="chromosome_name", pattern="LRG") head(k, n=3)
keys() method will not work with all key
types because they are not all supported.
But you can still use
select() here to extract columns of data
that match a particular set of keys (this is basically a wrapper for
affy=c("202763_at","209310_s_at","207500_at") select(mart, keys=affy, columns=c('affy_hg_u133_plus_2','entrezgene_id'), keytype='affy_hg_u133_plus_2')
## affy_hg_u133_plus_2 entrezgene_id ## 1 209310_s_at 837 ## 2 207500_at 838 ## 3 202763_at 836
So why would we want to do this when we already have functions like
getBM()? For two reasons: 1) for people who are familiar
with select and it’s helper methods, they can now proceed to use biomaRt making the same kinds of calls that are already familiar to them and 2) because the select method is implemented in many places elsewhere, the fact that these methods are shared allows for more convenient programmatic access of all these resources. An example of a package that takes advantage of this is the OrganismDbi package. Where several packages can be accessed as if they were one resource.
Note: if the function
useMart() runs into proxy problems you should set your proxy first before calling any biomaRt functions.
You can do this using the Sys.putenv command:
Sys.setenv("http_proxy" = "http://my.proxy.org:9999")
Some users have reported that the workaround above does not work, in this case an alternative proxy solution below can be tried:
options(RCurlOptions = list(proxy="uscache.kcc.com:80",proxyuserpwd="------:-------"))
## R version 4.0.2 (2020-06-22) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 18.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so ## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so ## ## locale: ##  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 ##  LC_COLLATE=C LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ##  LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C ##  LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ##  stats graphics grDevices utils datasets methods base ## ## other attached packages: ##  biomaRt_2.45.2 BiocStyle_2.17.0 ## ## loaded via a namespace (and not attached): ##  progress_1.2.2 tidyselect_1.1.0 xfun_0.15 purrr_0.3.4 ##  vctrs_0.3.1 generics_0.0.2 htmltools_0.5.0 stats4_4.0.2 ##  BiocFileCache_1.13.0 yaml_2.2.1 blob_1.2.1 XML_3.99-0.4 ##  rlang_0.4.6 pillar_1.4.4 glue_1.4.1 DBI_1.1.0 ##  rappdirs_0.3.1 BiocGenerics_0.35.4 bit64_0.9-7 dbplyr_1.4.4 ##  lifecycle_0.2.0 stringr_1.4.0 codetools_0.2-16 memoise_1.1.0 ##  evaluate_0.14 Biobase_2.49.0 knitr_1.29 IRanges_2.23.10 ##  parallel_4.0.2 curl_4.3 AnnotationDbi_1.51.1 highr_0.8 ##  Rcpp_220.127.116.11 openssl_1.4.2 BiocManager_1.30.10 S4Vectors_0.27.12 ##  bit_1.1-15.2 hms_0.5.3 askpass_1.1 digest_0.6.25 ##  stringi_1.4.6 bookdown_0.20 dplyr_1.0.0 tools_4.0.2 ##  magrittr_1.5 RSQLite_2.2.0 tibble_3.0.1 crayon_1.3.4 ##  pkgconfig_2.0.3 ellipsis_0.3.1 prettyunits_1.1.1 assertthat_0.2.1 ##  rmarkdown_2.3 httr_1.4.1 R6_2.4.1 compiler_4.0.2