How to make a data resource into a database package for Bioconductor

Why on earth would you want to do this?

Many kinds of data can be expressed as a simple table. So why would you not just use a data.frame to hold all your data?

Data as a single table quickly becomes hard to represent efficiently if the following things happen:

Missing data
Many to one relationships
Many to many relationships

Any of these things will cause your table to become very sparse and inneficient at representing the contents. So when that happens, it makes sense to break the data down into multiple tables. Relational databases like SQLite offer both an efficient means for storing data of this kind as well as a simple language for easily and quickly extracting it. SQL also allows you to recombine the results into single tables as needed.

Querying data using RSQLite and DBI

1st we need a database file. Lets steal one from an installed package

dbFile <- system.file("extdata",
    "TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite",
    package="TxDb.Hsapiens.UCSC.hg19.knownGene")
dbFile

## [1] "/Library/Frameworks/R.framework.devel/Versions/3.0/Resources/library/TxDb.Hsapiens.UCSC.hg19.knownGene/extdata/TxDb.Hsapiens.UCSC.hg19.knownGene.sqlite"

Now let's get a connection

library(RSQLite)

## Loading required package: DBI

txcon <- dbConnect(SQLite(), dbname=dbFile)

There is a handy helper function that lists tables:

dbListTables(txcon)

## [1] "cds"        "chrominfo"  "exon"       "gene"       "metadata"   "splicing"   "transcript"

Aand another one that lists fields for a table

dbListFields(txcon, name="metadata")

## [1] "name"  "value"

And now: let's actually look at some useful SQL queries..

First: let's look at a basic select statement

dbGetQuery(txcon, "SELECT * FROM metadata")

##                                        name                                        value
## 1                                   Db type                                 TranscriptDb
## 2                        Supporting package                              GenomicFeatures
## 3                               Data source                                         UCSC
## 4                                    Genome                                         hg19
## 5                                  Organism                                 Homo sapiens
## 6                                UCSC Table                                    knownGene
## 7                              Resource URL                      http://genome.ucsc.edu/
## 8                           Type of Gene ID                               Entrez Gene ID
## 9                              Full dataset                                          yes
## 10                         miRBase build ID                                       GRCh37
## 11                          transcript_nrow                                        82960
## 12                                exon_nrow                                       289969
## 13                                 cds_nrow                                       237533
## 14                            Db created by    GenomicFeatures package from Bioconductor
## 15                            Creation time 2013-09-19 00:15:15 -0700 (Thu, 19 Sep 2013)
## 16 GenomicFeatures version at creation time                                      1.13.40
## 17         RSQLite version at creation time                                       0.11.4
## 18                          DBSCHEMAVERSION                                          1.0

Now just get the 1st column

dbGetQuery(txcon, "SELECT name FROM metadata")

##                                        name
## 1                                   Db type
## 2                        Supporting package
## 3                               Data source
## 4                                    Genome
## 5                                  Organism
## 6                                UCSC Table
## 7                              Resource URL
## 8                           Type of Gene ID
## 9                              Full dataset
## 10                         miRBase build ID
## 11                          transcript_nrow
## 12                                exon_nrow
## 13                                 cds_nrow
## 14                            Db created by
## 15                            Creation time
## 16 GenomicFeatures version at creation time
## 17         RSQLite version at creation time
## 18                          DBSCHEMAVERSION

So using that, we can explore the chrominfo table like this

dbListFields(txcon, name="chrominfo")

## [1] "_chrom_id"   "chrom"       "length"      "is_circular"

dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo LIMIT 4")

##   chrom    length
## 1  chr1 249250621
## 2  chr2 243199373
## 3  chr3 198022430
## 4  chr4 191154276

And we can restrict our results to only get those chromosomes that are longer than 100000000 using a “WHERE” clause

dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo WHERE length > 100000000")

##    chrom    length
## 1   chr1 249250621
## 2   chr2 243199373
## 3   chr3 198022430
## 4   chr4 191154276
## 5   chr5 180915260
## 6   chr6 171115067
## 7   chr7 159138663
## 8   chr8 146364022
## 9   chr9 141213431
## 10 chr10 135534747
## 11 chr11 135006516
## 12 chr12 133851895
## 13 chr13 115169878
## 14 chr14 107349540
## 15 chr15 102531392
## 16  chrX 155270560

Or we can restrict it to where the chromsome name is 'chrX'

dbGetQuery(txcon, "SELECT chrom,length FROM chrominfo WHERE chrom='chrX'")

##   chrom    length
## 1  chrX 155270560

So far so good. But what if we want to get data from a couple tables at once?

lets look at a couple of different tables. 1st the gene table

dbGetQuery(txcon, "SELECT * FROM gene LIMIT 4")

##   gene_id _tx_id
## 1       1  70455
## 2       1  70456
## 3      10  31944
## 4     100  72132

And now the transcript table

dbGetQuery(txcon, "SELECT * FROM transcript LIMIT 4")

##   _tx_id    tx_name tx_chrom tx_strand tx_start tx_end
## 1      1 uc001aaa.3     chr1         +    11874  14409
## 2      2 uc010nxq.1     chr1         +    11874  14409
## 3      3 uc010nxr.1     chr1         +    11874  14409
## 4      4 uc001aal.1     chr1         +    69091  70008

Now for these two tables, we can do a simple inner join like this

sql <- "SELECT gene_id, tx_name 
        FROM gene,transcript 
        WHERE gene._tx_id=transcript._tx_id 
        LIMIT 4"
dbGetQuery(txcon, sql)

##   gene_id    tx_name
## 1       1 uc002qsd.4
## 2       1 uc002qsf.2
## 3      10 uc003wyw.1
## 4     100 uc002xmj.3

This same join can also be written in another fashion using the AS keyword to generate an alias.

sql <- "SELECT gene_id, tx_name, tx_strand
        FROM gene AS g, transcript AS t 
        WHERE g._tx_id=t._tx_id 
        LIMIT 4"
dbGetQuery(txcon, sql)

##   gene_id    tx_name tx_strand
## 1       1 uc002qsd.4         -
## 2       1 uc002qsf.2         -
## 3      10 uc003wyw.1         +
## 4     100 uc002xmj.3         -

At 1st it may seem like the alias is just a way to make queries shorter, but it is actually useful for subqueries. A subquery is when you need to make a query inside of another query. So for example you may want to combine a couple of queries like this:

sql <- "SELECT * FROM gene AS g,
       (SELECT * FROM transcript WHERE tx_strand='+') AS t
       WHERE  g._tx_id=t._tx_id
       LIMIT 4"
dbGetQuery(txcon, sql)

##     gene_id _tx_id _tx_id    tx_name tx_chrom tx_strand  tx_start    tx_end
## 1        10  31944  31944 uc003wyw.1     chr8         +  18248755  18258723
## 2 100008586  75890  75890 uc004doa.3     chrX         +  49217763  49233491
## 3 100009676  14200  14200 uc003dvg.3     chr3         + 101395274 101398057
## 4     10002  54546  54546 uc002ath.1    chr15         +  72102894  72107270

In the above example, the query “SELECT * FROM transcript WHERE tx_strand='+' is run and then the results of that are joined with the contents of the genes table by the outer query.”

If this all seems like it can get complicated in a hurry that's because it can. But fortunately, there exists another way for joining multiple tables that involves a lot less typing. When you have several tables that all contain the same foreign key, you can join them with the “USING” keyword like this:

sql <- "SELECT * FROM (SELECT * FROM gene, splicing USING(_tx_id)), transcript  USING (_tx_id) LIMIT 4"
dbGetQuery(txcon, sql)

##   gene_id _tx_id exon_rank _exon_id _cds_id    tx_name tx_chrom tx_strand tx_start   tx_end
## 1       1  70455         1   250818  206062 uc002qsd.4    chr19         - 58858172 58864865
## 2       1  70455         2   250817  206061 uc002qsd.4    chr19         - 58858172 58864865
## 3       1  70455         3   250816  206060 uc002qsd.4    chr19         - 58858172 58864865
## 4       1  70455         4   250815  206059 uc002qsd.4    chr19         - 58858172 58864865

Another query for joining these three tables would look like this:

sql <- "SELECT * FROM gene AS g, transcript AS t, splicing AS s WHERE g._tx_id=s._tx_id AND s._tx_id=t._tx_id LIMIT 4"
dbGetQuery(txcon, sql)

##   gene_id _tx_id _tx_id    tx_name tx_chrom tx_strand tx_start   tx_end _tx_id exon_rank _exon_id _cds_id
## 1       1  70455  70455 uc002qsd.4    chr19         - 58858172 58864865  70455         1   250818  206062
## 2       1  70455  70455 uc002qsd.4    chr19         - 58858172 58864865  70455         2   250817  206061
## 3       1  70455  70455 uc002qsd.4    chr19         - 58858172 58864865  70455         3   250816  206060
## 4       1  70455  70455 uc002qsd.4    chr19         - 58858172 58864865  70455         4   250815  206059

Of course there are other many other keywords that you can learn about as well, but this should be enough to get you started.

Exercises for Querying Databases from R.

Exercise 1: Use what you have learned to extract out the names and chromosomes from the transcript table.

Exercise 2: Now look at the splicing and exon tables. Notice how the splicing table has the same field “_exon_id” as the exon table. Use that field to join the contents of the two tables.

Exercise 3: Now look at the transcript table again. Notice that it too has a key that can be used to join to the splicing table. Now use the ALIAS feature described above along with using () to group a subequery so that you can merge the exon, splicing AND transcript table into a single result.

[ Back to top ]

Creating tables

1st lets look at some of the data we want to database.

gfl <- system.file("extdata","gene_names.txt", package="UnderstandingRBioc")
gene_names <- read.table(gfl, header=TRUE)
head(gene_names)

##   gene_id symbol                          name
## 1 3979178   ND4L NADH dehydrogenase subunit 4L
## 2 3979179    ND4  NADH dehydrogenase subunit 4
## 3 3979180    ND5  NADH dehydrogenase subunit 5
## 4 3979181    ND6  NADH dehydrogenase subunit 6
## 5 3979182   CYTB                  cytochrome b
## 6 3979183    ND1  NADH dehydrogenase subunit 1


cfl <- system.file("extdata","chroms.txt", package="UnderstandingRBioc")
chroms <- read.table(cfl, header=TRUE)
head(chroms)

##   gene_id chromosome
## 1 3979178         MT
## 2 3979179         MT
## 3 3979180         MT
## 4 3979181         MT
## 5 3979182         MT
## 6 3979183         MT


pfl <- system.file("extdata","pmids.txt", package="UnderstandingRBioc")
pmids <- read.table(pfl, header=TRUE)
head(pmids)

##   gene_id     pmid
## 1 3979178 17654009
## 2 3979179 17654009
## 3 3979180 17654009
## 4 3979181 17654009
## 5 3979182 17654009
## 6 3979183 17654009

Notice that for each of these data.frames, there is a common column that we want to use as a foreign key…

Next we need a new database. One that we can write into. The following code will spawn a DB in memory.

library(RSQLite)
dbFile <- sprintf("%s.sqlite", tempfile())
dbFile

## [1] "/var/folders/mq/fq9btv5n4wjclr32gn155fn80000gp/T//RtmpKZVyvm/file2722ebc6c88.sqlite"

con <- dbConnect(SQLite(), dbname=dbFile)

And once you create a table as below: your DB will exist on the disk as well.

Create table statements allow us to create a table.

sql <- "CREATE TABLE gene_info (
            gene_id TEXT,
            gene_symbol TEXT,
            gene_name TEXT
        )"
dbGetQuery(con, sql)

## NULL

You can easily use the helpers from before to verify that table is there

dbListTables(con)

## [1] "gene_info"

And look at its fields as well

dbListFields(con, name="gene_info")

## [1] "gene_id"     "gene_symbol" "gene_name"

Exercise for Creating Tables.

Exercise 4: Now that you know how to make a table, make one for the chrom and pmid data.frames above. Be sure to include all the fields that you want to populate.

[ Back to top ]

Populating tables

Lets start by paying close attention to the names of our data.frame. Notice how each column is named.

head(gene_names)

##   gene_id symbol                          name
## 1 3979178   ND4L NADH dehydrogenase subunit 4L
## 2 3979179    ND4  NADH dehydrogenase subunit 4
## 3 3979180    ND5  NADH dehydrogenase subunit 5
## 4 3979181    ND6  NADH dehydrogenase subunit 6
## 5 3979182   CYTB                  cytochrome b
## 6 3979183    ND1  NADH dehydrogenase subunit 1

Now we are going to insert that entire data.frame into the table we created above. But we are going to carefully name the columns of the data.frame that we want to insert by using their names like this:

sql <- "INSERT INTO gene_info
        VALUES ($gene_id, $symbol, $name)"
dbBeginTransaction(con)

## [1] TRUE

dbGetPreparedQuery(con, sql, bind.data = gene_names)

## NULL

dbCommit(con)

## [1] TRUE

Exercises for Populating Tables.

Exercise 5: Now take the next step and put the data from the pmid and chrom data.frames into your database. If you have not already done so, you should also make the gene_info table described above. Once you have done this. take a minute to make sure that the data you have inserted is present by writing some queries to extract it.

[ Back to top ]

Now we are going to talk about how we can formally make our database into an actual R object that people can use with select() etc.

Creating an AnnotationDb object

You always have to put the “Db type” and the “Supporting package” into a metadata table for your package. These fields are required by the loadDb supporting method so that it can spawn an AnnotationDb based object for you from your database. The 1st of these tells loadDb what kind of object it is supposed to make and the 2nd tells it where the methods for that object will be found. In this case we know what the “Db type” will be since it will be the new object type we want to define. But it is not yet clear what the “Supporting package” will be since we haven't made a package to hold our new methods yet. So just put AnnotationDbi in as a placeholder for now.

dbGetQuery(con, "CREATE Table metadata (name TEXT, value TEXT)")

## NULL

dbGetQuery(con, "INSERT INTO metadata VALUES ('Db type','myHamsterDb')")

## NULL

dbGetQuery(con, "INSERT INTO metadata VALUES ('Supporting package',
        'AnnotationDbi')")

## NULL

Test for success:

dbGetQuery(con, "SELECT * FROM metadata")

##                 name         value
## 1            Db type   myHamsterDb
## 2 Supporting package AnnotationDbi

Then declare the class and then use loadDb to make an instance of your object.

library(AnnotationDbi)

## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply,
##     parLapply, parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, append, as.data.frame, as.vector, cbind, colnames,
##     duplicated, eval, get, intersect, lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     rank, rbind, rep.int, rownames, sapply, setdiff, sort, table, tapply, union, unique, unlist
## 
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

setRefClass("myHamsterDb", contains="AnnotationDb")
myHamster.db <- loadDb(dbFile)

Exercise for Metadata.

Exercise 6: All databases that intend to eventually use the select method need to also have a metadata table with a name and value field. Run the code above to make sure you have one too. But then look at the original transcriptDb (txcon) database that we started with and examine the kind of data that is found in it's metadata table. What other kinds of data would be good to include here?

[ Back to top ]

Writing accessors

So once you have an object, you will want to make convenient accessors. There are four that most annotation packages support: 'collumns', 'keys', 'keytypes' and 'select'. The columns method normally just returns potential fields that can concievably be returned from the AnnotationDb object when using select. Here is how we can implement that method. First of all we should define a function that knows how to access the connection to our AnnotationDb object and then extract the fields from it.

.cols <- function(x)
{
    con <- AnnotationDbi:::dbConn(x)
    list <- dbListTables(con)
    unwanted <- c("metadata")
    list <- list[!list %in% unwanted]
    fields <- as.vector(sapply(list, dbListFields, con=con))
    toupper(fields)
}

Then simply wrap this into a method

setMethod("columns", "myHamsterDb", function(x){.cols(x)})

## [1] "columns"

Then we can call it like this

columns(myHamster.db)

## [1] "GENE_ID"     "GENE_SYMBOL" "GENE_NAME"

Exercise for Writing Accessors.

Exercise 7: The keytypes method often returns the same results as the columns method. Assuming that this is the case this time create a keytypes method.

[ Back to top ]

ANSWERS to exercises

Exercise 1:

dbGetQuery(txcon, "SELECT tx_name,tx_chrom FROM transcript LIMIT 4")

##      tx_name tx_chrom
## 1 uc001aaa.3     chr1
## 2 uc010nxq.1     chr1
## 3 uc010nxr.1     chr1
## 4 uc001aal.1     chr1

Exercise 2:

dbGetQuery(txcon, 
  "SELECT * FROM splicing, exon WHERE splicing._exon_id=exon._exon_id LIMIT 4")

##   _tx_id exon_rank _exon_id _cds_id _exon_id exon_name exon_chrom exon_strand exon_start exon_end
## 1      1         1        1      NA        1      <NA>       chr1           +      11874    12227
## 2      1         2        3      NA        3      <NA>       chr1           +      12613    12721
## 3      1         3        5      NA        5      <NA>       chr1           +      13221    14409
## 4      3         1        1      NA        1      <NA>       chr1           +      11874    12227

Exercise 3:

dbGetQuery(txcon,
   "SELECT * FROM transcript AS t,
   (SELECT * FROM splicing,exon WHERE splicing._exon_id=exon._exon_id) AS e
   WHERE e._tx_id=t._tx_id LIMIT 4")

##   _tx_id    tx_name tx_chrom tx_strand tx_start tx_end _tx_id exon_rank _exon_id _cds_id _exon_id:1 exon_name exon_chrom
## 1      1 uc001aaa.3     chr1         +    11874  14409      1         1        1      NA          1      <NA>       chr1
## 2      1 uc001aaa.3     chr1         +    11874  14409      1         2        3      NA          3      <NA>       chr1
## 3      1 uc001aaa.3     chr1         +    11874  14409      1         3        5      NA          5      <NA>       chr1
## 4      3 uc010nxr.1     chr1         +    11874  14409      3         1        1      NA          1      <NA>       chr1
##   exon_strand exon_start exon_end
## 1           +      11874    12227
## 2           +      12613    12721
## 3           +      13221    14409
## 4           +      11874    12227

Exercise 4:

sql <- "CREATE TABLE chrom (
            gene_id TEXT,
            chromosome TEXT
        )"
dbGetQuery(con, sql)

## NULL

sql <- "CREATE TABLE pmid (
            gene_id TEXT,
            pmid TEXT
        )"
dbGetQuery(con, sql)

## NULL

Exercise 5:

sql <- "INSERT INTO chrom
        VALUES ($gene_id, $chromosome)"
dbBeginTransaction(con)

## [1] TRUE

dbGetPreparedQuery(con, sql, bind.data = chroms)

## NULL

dbCommit(con)

## [1] TRUE

sql <- "INSERT INTO pmid
        VALUES ($gene_id, $pmid)"
dbBeginTransaction(con)

## [1] TRUE

dbGetPreparedQuery(con, sql, bind.data = pmids)

## NULL

dbCommit(con)

## [1] TRUE

dbGetQuery(con, "SELECT * FROM chrom LIMIT 4")

##   gene_id chromosome
## 1 3979178         MT
## 2 3979179         MT
## 3 3979180         MT
## 4 3979181         MT

dbGetQuery(con, "SELECT * FROM pmid LIMIT 4")

##   gene_id     pmid
## 1 3979178 17654009
## 2 3979179 17654009
## 3 3979180 17654009
## 4 3979181 17654009

Exercise 6:

This is the code that will create a metadata table for you:

dbGetQuery(con, "CREATE Table metadata (name TEXT, value TEXT)")
dbGetQuery(con, "INSERT INTO metadata VALUES ('Db type','myHamsterDb')")
dbGetQuery(con, "INSERT INTO metadata VALUES ('Supporting package',
        'AnnotationDbi')")

And looking at the other metadata table we see all sorts of interesting stuff:

dbGetQuery(txcon, "SELECT * FROM metadata")

##                                        name                                        value
## 1                                   Db type                                 TranscriptDb
## 2                        Supporting package                              GenomicFeatures
## 3                               Data source                                         UCSC
## 4                                    Genome                                         hg19
## 5                                  Organism                                 Homo sapiens
## 6                                UCSC Table                                    knownGene
## 7                              Resource URL                      http://genome.ucsc.edu/
## 8                           Type of Gene ID                               Entrez Gene ID
## 9                              Full dataset                                          yes
## 10                         miRBase build ID                                       GRCh37
## 11                          transcript_nrow                                        82960
## 12                                exon_nrow                                       289969
## 13                                 cds_nrow                                       237533
## 14                            Db created by    GenomicFeatures package from Bioconductor
## 15                            Creation time 2013-09-19 00:15:15 -0700 (Thu, 19 Sep 2013)
## 16 GenomicFeatures version at creation time                                      1.13.40
## 17         RSQLite version at creation time                                       0.11.4
## 18                          DBSCHEMAVERSION                                          1.0

Exercise 7:

Since the keytypes method will be the same as columns, we can just use the same underlying function here.

setMethod("keytypes", "myHamsterDb", function(x){.cols(x)})

## [1] "keytypes"

Then we can call it like this

keytypes(myHamster.db)

## [1] "C(\"GENE_ID\", \"CHROMOSOME\")"                 "C(\"GENE_ID\", \"GENE_SYMBOL\", \"GENE_NAME\")"
## [3] "C(\"GENE_ID\", \"PMID\")"